[GE users] Long delay when submitting large jobs (resend)

christian reissmann Christian.Reissmann at Sun.COM
Tue Jan 18 09:52:36 GMT 2005


Resending my last answer. Has anyone not received my first email?
We are observing unreliable mail delivery.

christian reissmann wrote:
> Hi Craig,
> 
> there are some bugs in communication lib:
> 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1389
> 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1400
> 
> It may be the reason for your problems.
> 
> BTW: You can use the qping -info command to check how many messages
> are in the commlib read/write buffers and the response time of the
> communication
> library:
> 
> o The qping anser is sent by the communication library threads.
> 
> o A gdi message is processed by the qmaster threads.
> 
> 
> 
> Example:
> 
>  %qping -info gridware $SGE_QMASTER_PORT qmaster 1
> 01/18/2005 09:49:12:
> SIRM version:             0.1
> SIRM message id:          1
> start time:               01/17/2005 22:47:02 (1105998422)
> run time [s]:             39730
> messages in read buffer:  0
> messages in write buffer: 0
> nr. of connected clients: 8
> status:                   0
> info:                     EDT: R (0.63) | TET: R (1.64) | MT: R (0.24) |
> SIGT: R (39729.65) | ok
> 
> 
> The maximum nr. of connected clients depends on the file descriptor
> limit and should be
> less then the file descriptor limit, because the commlib is reserving
> some file descriptors
> for the application (qmaster).
> 
> The number of file descriptors used for communication is logged into the
> qmaster messages file
> at qmaster startup:
> 
> "qmaster will use max. 1004 file descriptors for communication"
> 
> If your number of execds exceeds the number of usable file descriptors
> it is better to
> raise the file descriptor limit on your qmaster host. If the maximum
> file descriptor
> limit is reached the commlib is starting to close connections to the
> execds and reopen
> the connections if necessary.
> 
> 
> Best Regards,
> 
> Christian
> 
> 
> 
> 
> Craig Tierney wrote:
> 
>>On Mon, 2005-01-17 at 18:54, Ron Chen wrote:
>>
>>
>>>Before you upgraded to SGE 6.0, did you see similar
>>>problems with SGE 5.3?
>>>
>>
>>
>>We have been running SGE 6.0u1 with NFS spool for a couple of months.
>>We are happy with it and all the new features.  The
>>big problem is job transition.  Since Stephan suggested
>>that BDB performans better, we tried that.  
>>
>>
>>
>>
>>>If SGE 5.3 works fine, may be it's related to the new
>>>threaded communication library (SGE 5.3 uses commd).
>>
>>
>>I think that while jobs transition from 'qw' to 't', that
>>the thread that deals with GDI communications isn't threaded,
>>and it blocks during job transition.  The bigger the job, the longer
>>the wait.
>>
>>I am tweaking max_unheard to a much larger number to see
>>if that helps.  That seemed like one of the loops in the
>>code that could generate alot of IO and hold things up.
>>
>>Craig
>>
>>
>>
>>
>>
>>>(pure guessing)
>>>
>>>-Ron
>>>
>>>
>>>--- Craig Tierney <ctierney at hpti.com> wrote:
>>>
>>>
>>>>I reinstalled SGE temporarily using BDB to see if
>>>>that
>>>>would improve startup times.  It took about 75
>>>>seconds
>>>>for a 512 processor job to transition from 'qw' to
>>>>'t'.
>>>>
>>>>The server was running DBD and it was installed on
>>>>the local disk.
>>>>The $SGE_ROOT/default was still on NFS because the
>>>>other nodes
>>>>do not have disks.  However, the IO to the
>>>>filesystem from the clients
>>>>is small.
>>>>
>>>>I ran strace on one of the sge_qmaster processes to
>>>>try and
>>>>see what is going on.  I can't pick-out exactly what
>>>>is going
>>>>on, but I did see that /etc/hosts was mmap'ed once
>>>>for each
>>>>node.  I know that /etc/hosts should be cached, but
>>>>I don't see
>>>>what gethostbyname (or whichever function it is)
>>>>needs to be called
>>>>directly for each host.  The file shouldn't be
>>>>changing during a
>>>>job startup.
>>>>
>>>>There were many other mmap/munmap calls as well as
>>>>calls to
>>>>gettimeofday.  However, I couldn't correlate it to
>>>>exactly what
>>>>it was doing.
>>>>
>>>>When qmaster starts up a job, does it talk to each
>>>>host, one by
>>>>one, setting up the job information?  The scheduler
>>>>actually picks
>>>>the nodes used, correct?  If qmaster is talking to
>>>>each node,
>>>>is it done serially or are multiple requests sent
>>>>out simultaneously?
>>>>
>>>>Thanks,
>>>>Craig
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>The execd should also spool localy. What is the
>>>>
>>>>reason for not doing
>>>>
>>>>
>>>>>it?
>>>>>
>>>>>
>>>>>> 
>>>>>>
>>>>>>
>>>>>>>5)any stageing activity between master and
>>>>
>>>>compute nodes?
>>>>
>>>>
>>>>>>>   
>>>>>>
>>>>>>No.
>>>>>>
>>>>>>I don't care if my job takes 10 minutes to
>>>>
>>>>start.  That isn't
>>>>
>>>>
>>>>>>the problem.  It is that the batch system hangs
>>>>
>>>>during this time.
>>>>
>>>>
>>>>>>That is should not do.  It is not dependent on
>>>>
>>>>the type of job, 
>>>>
>>>>
>>>>>>just the number of cpus (nodes) used.
>>>>>>
>>>>>>Thanks,
>>>>>>Craig
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 
>>>>>>
>>>>>>
>>>>>>>regards
>>>>>>>
>>>>>>>
>>>>>>>   
>>>>>>
>>>>>>It has nothing to do with the binary.  This is
>>>>
>>>>the time
>>>>
>>>>
>>>>>>before the job script is actually launched.  I
>>>>
>>>>don't even
>>>>
>>>>
>>>>>>think this time covers the prolog/epilog
>>>>
>>>>execution.  My
>>>>
>>>>
>>>>>>prolog/epilog can run long (touches all nodes in
>>>>
>>>>parallel), but
>>>>
>>>>
>>>>>>the batch system shouldn't be waiting on that.
>>>>>>
>>>>>>Craig
>>>>>>
>>>>>>
>>>>>>
>>>>>> 
>>>>>>
>>>>>>
>>>>>>>On Fri, 14 Jan 2005 11:29:58 -0700, Craig
>>>>
>>>>Tierney <ctierney at hpti.com> wrote:
>>>>
>>>>
>>>>>>>   
>>>>>>>
>>>>>>>
>>>>>>>>I have been running SGE6.0u1 for a few
>>>>
>>>>months now on a new system.
>>>>
>>>>
>>>>>>>>I have noticed very long delays, or even SGE
>>>>
>>>>hangs, when starting
>>>>
>>>>
>>>>>>>>large jobs.  I just tried this on the latest
>>>>
>>>>CVS source and
>>>>
>>>>
>>>>>>>>the problem persists.
>>>>>>>>
>>>>>>>>It appears that the hang while the job is
>>>>
>>>>moved from 'qw' to t.
>>>>
>>>>
>>>>>>>>In general the system does continue to
>>>>
>>>>operate normally.  However
>>>>
>>>>
>>>>>>>>the delays can be large, 30-60 seconds. 
>>>>
>>>>'Hang' is defined as
>>>>
>>>>
>>>>>>>>system commands like qsub and qstat will
>>>>
>>>>delay until the job
>>>>
>>>>
>>>>>>>>has finished migrating to the 't' status. 
>>>>
>>>>Sometimes the delays
>>>>
>>>>
>>>>>>>>are long enough to get GDI failures.  Since
>>>>
>>>>qmaster is threaded,
>>>>
>>>>
>>>>>>>>I wonder why I get the hangs.
>>>>>>>>
>>>>>>>>I have tried debugging the situation. 
>>>>
>>>>Either the hang is in qmaster,
>>>>
>>>>
>>>>>>>>or sge_schedd is not printing enough
>>>>
>>>>information.
>>>>
>>>>
>>>>>>>>Here is some of the text from the sge_schedd
>>>>
>>>>debug for a 256 cpu job
>>>>
>>>>
>>>>>>>>using a cluster queue.
>>>>>>>>
>>>>>>>>79347   7886 16384     J=179999.1
>>>>
>>>>T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0129
>>>>R=slots U=2.000000
>>>>
>>>>
>>>>>>>>79348   7886 16384     J=179999.1
>>>>
>>>>T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0130
>>>>R=slots U=2.000000
>>>>
>>>>
>>>>>>>>79349   7886 16384     J=179999.1
>>>>
>>>>T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0131
>>>>R=slots U=2.000000
>>>>
>>>>
>>>>>>>>79350   7886 16384     Found NOW assignment
>>>>>>>>79351   7886 16384     reresolve port
>>>>
>>>>timeout in 536
>>>>
>>>>
>>>>>>>>79352   7886 16384     returning cached
>>>>
>>>>port value: 536
>>>>
>>>>
>>>>>>>>scheduler tries to schedule job 179999.1
>>>>
>>>>twice
>>>>
>>>>
>>>>>>>>79353   7886 16384        added 0 ticket
>>>>
>>>>orders for queued jobs
>>>>
>>>>
>>>>>>>>79354   7886 16384     SENDING 10 ORDERS TO
>>>>
>>>>QMASTER
>>>>
>>>>
>>>>>>>>79355   7886 16384     RESETTING BUSY STATE
>>>>
>>>>OF EVENT CLIENT
>>>>
>>>>
>>>>>>>>79356   7886 16384     reresolve port
>>>>
>>>>timeout in 536
>>>>
>>>>
>>>>>>>>79357   7886 16384     returning cached
>>>>
>>>>port value: 536
>>>>
>>>>
>>>>>>>>79358   7886 16384     ec_get retrieving
>>>>
>>>>events - will do max 3 fetches
>>>>
>>>>
>>>>>>>>The hang happens after line 79352.  In this
>>>>
>>>>instance the message
>>>>
>>>>
>>>>>>>>indicates the scheduler tried twice.  Other
>>>>
>>>>times, I get a timeout
>>>>
>>>>
>>>>>>>>at this point.  In either case, the output
>>>>
>>>>pauses in the same
>>>>
>>>>
>>>>>>>>manner that a call to qsub or qstat would.
>>>>>>>>
>>>>>>>>I have followed the optimization procedures
>>>>
>>>>listed on the website
>>>>
>>>>
>>>>>>>>and they didn't seem to help (might have
>>>>
>>>>missed some though).
>>>>
>>>>
>>>>>>>>I don't have any information from
>>>>
>>>>sge_qmaster.  I tried several
>>>>
>>>>
>>>>>>>>different SGE_DEBUG_LEVEL settings, but
>>>>
>>>>sge_qmaster would always
>>>>
>>>>
>>>>>>>>stop providing information after
>>>>
>>>>daemonizing.
>>>>
>>>>
>>>>>>>>System configuration:
>>>>>>>>
>>>>>>>>Qmaster runs on Fedora Core 2, x86, (2.2 Ghz
>>>>
>>>>Xeon)
>>>>
>>>>
>>>>>>>>clients (execd) run on Suse 9.1 x86_64, (3.2
>>>>
>>>>Ghz EM64T)
>>>>
>>>>
>>>>>>>>SGE is configured to use old style spooling
>>>>
>>>>over NFS
>>>>
>>>>
>>>>>>>>I can provide more info, I just don't know
>>>>
>>>>where to go from here.
>>>>
>>>>
>>>>>>>>Thanks,
>>>>
>>>=== message truncated ===
>>>
>>>
>>>
>>>		
>>>__________________________________ 
>>>Do you Yahoo!? 
>>>All your favorites on one personal page  Try My Yahoo!
>>>http://my.yahoo.com
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> 

-- 
Christian Reissmann    Tel: +49 (0)941 3075 112  mailto:crei at sun.com
Software Engineer      Fax: +49 (0)941 3075 222
http://www.sun.com/gridengine
Sun Microsystems GmbH, Dr.-Leo-Ritter-Str. 7,
D-93049 Regensburg,    Tel: +49 (0)941 3075 0


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list