[GE users] Long delay when submitting large jobs

christian reissmann Christian.Reissmann at Sun.COM
Tue Jan 18 09:03:53 GMT 2005


Hi Craig,

there are some bugs in communication lib:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1389

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1400

It may be the reason for your problems.

BTW: You can use the qping -info command to check how many messages
are in the commlib read/write buffers and the response time of the
communication
library:

o The qping anser is sent by the communication library threads.

o A gdi message is processed by the qmaster threads.



Example:

 %qping -info gridware $SGE_QMASTER_PORT qmaster 1
01/18/2005 09:49:12:
SIRM version:             0.1
SIRM message id:          1
start time:               01/17/2005 22:47:02 (1105998422)
run time [s]:             39730
messages in read buffer:  0
messages in write buffer: 0
nr. of connected clients: 8
status:                   0
info:                     EDT: R (0.63) | TET: R (1.64) | MT: R (0.24) |
SIGT: R (39729.65) | ok


The maximum nr. of connected clients depends on the file descriptor
limit and should be
less then the file descriptor limit, because the commlib is reserving
some file descriptors
for the application (qmaster).

The number of file descriptors used for communication is logged into the
qmaster messages file
at qmaster startup:

"qmaster will use max. 1004 file descriptors for communication"

If your number of execds exceeds the number of usable file descriptors
it is better to
raise the file descriptor limit on your qmaster host. If the maximum
file descriptor
limit is reached the commlib is starting to close connections to the
execds and reopen
the connections if necessary.


Best Regards,

Christian




Craig Tierney wrote:
> On Mon, 2005-01-17 at 18:54, Ron Chen wrote:
> 
>>Before you upgraded to SGE 6.0, did you see similar
>>problems with SGE 5.3?
>>
> 
> 
> We have been running SGE 6.0u1 with NFS spool for a couple of months.
> We are happy with it and all the new features.  The
> big problem is job transition.  Since Stephan suggested
> that BDB performans better, we tried that.  
> 
> 
> 
>>If SGE 5.3 works fine, may be it's related to the new
>>threaded communication library (SGE 5.3 uses commd).
> 
> 
> I think that while jobs transition from 'qw' to 't', that
> the thread that deals with GDI communications isn't threaded,
> and it blocks during job transition.  The bigger the job, the longer
> the wait.
> 
> I am tweaking max_unheard to a much larger number to see
> if that helps.  That seemed like one of the loops in the
> code that could generate alot of IO and hold things up.
> 
> Craig
> 
> 
> 
> 
>>(pure guessing)
>>
>> -Ron
>>
>>
>>--- Craig Tierney <ctierney at hpti.com> wrote:
>>
>>>I reinstalled SGE temporarily using BDB to see if
>>>that
>>>would improve startup times.  It took about 75
>>>seconds
>>>for a 512 processor job to transition from 'qw' to
>>>'t'.
>>>
>>>The server was running DBD and it was installed on
>>>the local disk.
>>>The $SGE_ROOT/default was still on NFS because the
>>>other nodes
>>>do not have disks.  However, the IO to the
>>>filesystem from the clients
>>>is small.
>>>
>>>I ran strace on one of the sge_qmaster processes to
>>>try and
>>>see what is going on.  I can't pick-out exactly what
>>>is going
>>>on, but I did see that /etc/hosts was mmap'ed once
>>>for each
>>>node.  I know that /etc/hosts should be cached, but
>>>I don't see
>>>what gethostbyname (or whichever function it is)
>>>needs to be called
>>>directly for each host.  The file shouldn't be
>>>changing during a
>>>job startup.
>>>
>>>There were many other mmap/munmap calls as well as
>>>calls to
>>>gettimeofday.  However, I couldn't correlate it to
>>>exactly what
>>>it was doing.
>>>
>>>When qmaster starts up a job, does it talk to each
>>>host, one by
>>>one, setting up the job information?  The scheduler
>>>actually picks
>>>the nodes used, correct?  If qmaster is talking to
>>>each node,
>>>is it done serially or are multiple requests sent
>>>out simultaneously?
>>>
>>>Thanks,
>>>Craig
>>>
>>>
>>>
>>>
>>>
>>>
>>>>The execd should also spool localy. What is the
>>>
>>>reason for not doing
>>>
>>>>it?
>>>>
>>>>>  
>>>>>
>>>>>>5)any stageing activity between master and
>>>
>>>compute nodes?
>>>
>>>>>>    
>>>>>
>>>>>No.
>>>>>
>>>>>I don't care if my job takes 10 minutes to
>>>
>>>start.  That isn't
>>>
>>>>>the problem.  It is that the batch system hangs
>>>
>>>during this time.
>>>
>>>>>That is should not do.  It is not dependent on
>>>
>>>the type of job, 
>>>
>>>>>just the number of cpus (nodes) used.
>>>>>
>>>>>Thanks,
>>>>>Craig
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  
>>>>>
>>>>>>regards
>>>>>>
>>>>>>
>>>>>>    
>>>>>
>>>>>It has nothing to do with the binary.  This is
>>>
>>>the time
>>>
>>>>>before the job script is actually launched.  I
>>>
>>>don't even
>>>
>>>>>think this time covers the prolog/epilog
>>>
>>>execution.  My
>>>
>>>>>prolog/epilog can run long (touches all nodes in
>>>
>>>parallel), but
>>>
>>>>>the batch system shouldn't be waiting on that.
>>>>>
>>>>>Craig
>>>>>
>>>>>
>>>>>
>>>>>  
>>>>>
>>>>>>On Fri, 14 Jan 2005 11:29:58 -0700, Craig
>>>
>>>Tierney <ctierney at hpti.com> wrote:
>>>
>>>>>>    
>>>>>>
>>>>>>>I have been running SGE6.0u1 for a few
>>>
>>>months now on a new system.
>>>
>>>>>>>I have noticed very long delays, or even SGE
>>>
>>>hangs, when starting
>>>
>>>>>>>large jobs.  I just tried this on the latest
>>>
>>>CVS source and
>>>
>>>>>>>the problem persists.
>>>>>>>
>>>>>>>It appears that the hang while the job is
>>>
>>>moved from 'qw' to t.
>>>
>>>>>>>In general the system does continue to
>>>
>>>operate normally.  However
>>>
>>>>>>>the delays can be large, 30-60 seconds. 
>>>
>>>'Hang' is defined as
>>>
>>>>>>>system commands like qsub and qstat will
>>>
>>>delay until the job
>>>
>>>>>>>has finished migrating to the 't' status. 
>>>
>>>Sometimes the delays
>>>
>>>>>>>are long enough to get GDI failures.  Since
>>>
>>>qmaster is threaded,
>>>
>>>>>>>I wonder why I get the hangs.
>>>>>>>
>>>>>>>I have tried debugging the situation. 
>>>
>>>Either the hang is in qmaster,
>>>
>>>>>>>or sge_schedd is not printing enough
>>>
>>>information.
>>>
>>>>>>>Here is some of the text from the sge_schedd
>>>
>>>debug for a 256 cpu job
>>>
>>>>>>>using a cluster queue.
>>>>>>>
>>>>>>> 79347   7886 16384     J=179999.1
>>>
>>>T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0129
>>>R=slots U=2.000000
>>>
>>>>>>> 79348   7886 16384     J=179999.1
>>>
>>>T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0130
>>>R=slots U=2.000000
>>>
>>>>>>> 79349   7886 16384     J=179999.1
>>>
>>>T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0131
>>>R=slots U=2.000000
>>>
>>>>>>> 79350   7886 16384     Found NOW assignment
>>>>>>> 79351   7886 16384     reresolve port
>>>
>>>timeout in 536
>>>
>>>>>>> 79352   7886 16384     returning cached
>>>
>>>port value: 536
>>>
>>>>>>>scheduler tries to schedule job 179999.1
>>>
>>>twice
>>>
>>>>>>> 79353   7886 16384        added 0 ticket
>>>
>>>orders for queued jobs
>>>
>>>>>>> 79354   7886 16384     SENDING 10 ORDERS TO
>>>
>>>QMASTER
>>>
>>>>>>> 79355   7886 16384     RESETTING BUSY STATE
>>>
>>>OF EVENT CLIENT
>>>
>>>>>>> 79356   7886 16384     reresolve port
>>>
>>>timeout in 536
>>>
>>>>>>> 79357   7886 16384     returning cached
>>>
>>>port value: 536
>>>
>>>>>>> 79358   7886 16384     ec_get retrieving
>>>
>>>events - will do max 3 fetches
>>>
>>>>>>>The hang happens after line 79352.  In this
>>>
>>>instance the message
>>>
>>>>>>>indicates the scheduler tried twice.  Other
>>>
>>>times, I get a timeout
>>>
>>>>>>>at this point.  In either case, the output
>>>
>>>pauses in the same
>>>
>>>>>>>manner that a call to qsub or qstat would.
>>>>>>>
>>>>>>>I have followed the optimization procedures
>>>
>>>listed on the website
>>>
>>>>>>>and they didn't seem to help (might have
>>>
>>>missed some though).
>>>
>>>>>>>I don't have any information from
>>>
>>>sge_qmaster.  I tried several
>>>
>>>>>>>different SGE_DEBUG_LEVEL settings, but
>>>
>>>sge_qmaster would always
>>>
>>>>>>>stop providing information after
>>>
>>>daemonizing.
>>>
>>>>>>>System configuration:
>>>>>>>
>>>>>>>Qmaster runs on Fedora Core 2, x86, (2.2 Ghz
>>>
>>>Xeon)
>>>
>>>>>>>clients (execd) run on Suse 9.1 x86_64, (3.2
>>>
>>>Ghz EM64T)
>>>
>>>>>>>SGE is configured to use old style spooling
>>>
>>>over NFS
>>>
>>>>>>>I can provide more info, I just don't know
>>>
>>>where to go from here.
>>>
>>>>>>>Thanks,
>>>
>>=== message truncated ===
>>
>>
>>
>>		
>>__________________________________ 
>>Do you Yahoo!? 
>>All your favorites on one personal page  Try My Yahoo!
>>http://my.yahoo.com
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Christian Reissmann    Tel: +49 (0)941 3075 112  mailto:crei at sun.com
Software Engineer      Fax: +49 (0)941 3075 222
http://www.sun.com/gridengine
Sun Microsystems GmbH, Dr.-Leo-Ritter-Str. 7,
D-93049 Regensburg,    Tel: +49 (0)941 3075 0


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list