[GE users] Long delay when submitting large jobs

Craig Tierney ctierney at hpti.com
Tue Jan 18 17:00:26 GMT 2005


On Tue, 2005-01-18 at 02:03, christian reissmann wrote:
> Hi Craig,
> 
> there are some bugs in communication lib:
> 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1389
> 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1400
> 
> It may be the reason for your problems.

I am running the latest CVS tree, so I don't think that
id=1389 is the problem.  I am not seeing any errors related
to file descriptors.  

Everything works it is just blocking when it shouldn't be.
> 
> BTW: You can use the qping -info command to check how many messages
> are in the commlib read/write buffers and the response time of the
> communication
> library:
> 
> o The qping anser is sent by the communication library threads.
> 
> o A gdi message is processed by the qmaster threads.
> 

I will give this a try.

Craig

> 
> 
> Example:
> 
>  %qping -info gridware $SGE_QMASTER_PORT qmaster 1
> 01/18/2005 09:49:12:
> SIRM version:             0.1
> SIRM message id:          1
> start time:               01/17/2005 22:47:02 (1105998422)
> run time [s]:             39730
> messages in read buffer:  0
> messages in write buffer: 0
> nr. of connected clients: 8
> status:                   0
> info:                     EDT: R (0.63) | TET: R (1.64) | MT: R (0.24) |
> SIGT: R (39729.65) | ok
> 
> 
> The maximum nr. of connected clients depends on the file descriptor
> limit and should be
> less then the file descriptor limit, because the commlib is reserving
> some file descriptors
> for the application (qmaster).
> 
> The number of file descriptors used for communication is logged into the
> qmaster messages file
> at qmaster startup:
> 
> "qmaster will use max. 1004 file descriptors for communication"
> 
> If your number of execds exceeds the number of usable file descriptors
> it is better to
> raise the file descriptor limit on your qmaster host. If the maximum
> file descriptor
> limit is reached the commlib is starting to close connections to the
> execds and reopen
> the connections if necessary.
> 
> 
> Best Regards,
> 
> Christian
> 
> 
> 
> 
> Craig Tierney wrote:
> > On Mon, 2005-01-17 at 18:54, Ron Chen wrote:
> > 
> >>Before you upgraded to SGE 6.0, did you see similar
> >>problems with SGE 5.3?
> >>
> > 
> > 
> > We have been running SGE 6.0u1 with NFS spool for a couple of months.
> > We are happy with it and all the new features.  The
> > big problem is job transition.  Since Stephan suggested
> > that BDB performans better, we tried that.  
> > 
> > 
> > 
> >>If SGE 5.3 works fine, may be it's related to the new
> >>threaded communication library (SGE 5.3 uses commd).
> > 
> > 
> > I think that while jobs transition from 'qw' to 't', that
> > the thread that deals with GDI communications isn't threaded,
> > and it blocks during job transition.  The bigger the job, the longer
> > the wait.
> > 
> > I am tweaking max_unheard to a much larger number to see
> > if that helps.  That seemed like one of the loops in the
> > code that could generate alot of IO and hold things up.
> > 
> > Craig
> > 
> > 
> > 
> > 
> >>(pure guessing)
> >>
> >> -Ron
> >>
> >>
> >>--- Craig Tierney <ctierney at hpti.com> wrote:
> >>
> >>>I reinstalled SGE temporarily using BDB to see if
> >>>that
> >>>would improve startup times.  It took about 75
> >>>seconds
> >>>for a 512 processor job to transition from 'qw' to
> >>>'t'.
> >>>
> >>>The server was running DBD and it was installed on
> >>>the local disk.
> >>>The $SGE_ROOT/default was still on NFS because the
> >>>other nodes
> >>>do not have disks.  However, the IO to the
> >>>filesystem from the clients
> >>>is small.
> >>>
> >>>I ran strace on one of the sge_qmaster processes to
> >>>try and
> >>>see what is going on.  I can't pick-out exactly what
> >>>is going
> >>>on, but I did see that /etc/hosts was mmap'ed once
> >>>for each
> >>>node.  I know that /etc/hosts should be cached, but
> >>>I don't see
> >>>what gethostbyname (or whichever function it is)
> >>>needs to be called
> >>>directly for each host.  The file shouldn't be
> >>>changing during a
> >>>job startup.
> >>>
> >>>There were many other mmap/munmap calls as well as
> >>>calls to
> >>>gettimeofday.  However, I couldn't correlate it to
> >>>exactly what
> >>>it was doing.
> >>>
> >>>When qmaster starts up a job, does it talk to each
> >>>host, one by
> >>>one, setting up the job information?  The scheduler
> >>>actually picks
> >>>the nodes used, correct?  If qmaster is talking to
> >>>each node,
> >>>is it done serially or are multiple requests sent
> >>>out simultaneously?
> >>>
> >>>Thanks,
> >>>Craig
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>The execd should also spool localy. What is the
> >>>
> >>>reason for not doing
> >>>
> >>>>it?
> >>>>
> >>>>>  
> >>>>>
> >>>>>>5)any stageing activity between master and
> >>>
> >>>compute nodes?
> >>>
> >>>>>>    
> >>>>>
> >>>>>No.
> >>>>>
> >>>>>I don't care if my job takes 10 minutes to
> >>>
> >>>start.  That isn't
> >>>
> >>>>>the problem.  It is that the batch system hangs
> >>>
> >>>during this time.
> >>>
> >>>>>That is should not do.  It is not dependent on
> >>>
> >>>the type of job, 
> >>>
> >>>>>just the number of cpus (nodes) used.
> >>>>>
> >>>>>Thanks,
> >>>>>Craig
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>  
> >>>>>
> >>>>>>regards
> >>>>>>
> >>>>>>
> >>>>>>    
> >>>>>
> >>>>>It has nothing to do with the binary.  This is
> >>>
> >>>the time
> >>>
> >>>>>before the job script is actually launched.  I
> >>>
> >>>don't even
> >>>
> >>>>>think this time covers the prolog/epilog
> >>>
> >>>execution.  My
> >>>
> >>>>>prolog/epilog can run long (touches all nodes in
> >>>
> >>>parallel), but
> >>>
> >>>>>the batch system shouldn't be waiting on that.
> >>>>>
> >>>>>Craig
> >>>>>
> >>>>>
> >>>>>
> >>>>>  
> >>>>>
> >>>>>>On Fri, 14 Jan 2005 11:29:58 -0700, Craig
> >>>
> >>>Tierney <ctierney at hpti.com> wrote:
> >>>
> >>>>>>    
> >>>>>>
> >>>>>>>I have been running SGE6.0u1 for a few
> >>>
> >>>months now on a new system.
> >>>
> >>>>>>>I have noticed very long delays, or even SGE
> >>>
> >>>hangs, when starting
> >>>
> >>>>>>>large jobs.  I just tried this on the latest
> >>>
> >>>CVS source and
> >>>
> >>>>>>>the problem persists.
> >>>>>>>
> >>>>>>>It appears that the hang while the job is
> >>>
> >>>moved from 'qw' to t.
> >>>
> >>>>>>>In general the system does continue to
> >>>
> >>>operate normally.  However
> >>>
> >>>>>>>the delays can be large, 30-60 seconds. 
> >>>
> >>>'Hang' is defined as
> >>>
> >>>>>>>system commands like qsub and qstat will
> >>>
> >>>delay until the job
> >>>
> >>>>>>>has finished migrating to the 't' status. 
> >>>
> >>>Sometimes the delays
> >>>
> >>>>>>>are long enough to get GDI failures.  Since
> >>>
> >>>qmaster is threaded,
> >>>
> >>>>>>>I wonder why I get the hangs.
> >>>>>>>
> >>>>>>>I have tried debugging the situation. 
> >>>
> >>>Either the hang is in qmaster,
> >>>
> >>>>>>>or sge_schedd is not printing enough
> >>>
> >>>information.
> >>>
> >>>>>>>Here is some of the text from the sge_schedd
> >>>
> >>>debug for a 256 cpu job
> >>>
> >>>>>>>using a cluster queue.
> >>>>>>>
> >>>>>>> 79347   7886 16384     J=179999.1
> >>>
> >>>T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0129
> >>>R=slots U=2.000000
> >>>
> >>>>>>> 79348   7886 16384     J=179999.1
> >>>
> >>>T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0130
> >>>R=slots U=2.000000
> >>>
> >>>>>>> 79349   7886 16384     J=179999.1
> >>>
> >>>T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0131
> >>>R=slots U=2.000000
> >>>
> >>>>>>> 79350   7886 16384     Found NOW assignment
> >>>>>>> 79351   7886 16384     reresolve port
> >>>
> >>>timeout in 536
> >>>
> >>>>>>> 79352   7886 16384     returning cached
> >>>
> >>>port value: 536
> >>>
> >>>>>>>scheduler tries to schedule job 179999.1
> >>>
> >>>twice
> >>>
> >>>>>>> 79353   7886 16384        added 0 ticket
> >>>
> >>>orders for queued jobs
> >>>
> >>>>>>> 79354   7886 16384     SENDING 10 ORDERS TO
> >>>
> >>>QMASTER
> >>>
> >>>>>>> 79355   7886 16384     RESETTING BUSY STATE
> >>>
> >>>OF EVENT CLIENT
> >>>
> >>>>>>> 79356   7886 16384     reresolve port
> >>>
> >>>timeout in 536
> >>>
> >>>>>>> 79357   7886 16384     returning cached
> >>>
> >>>port value: 536
> >>>
> >>>>>>> 79358   7886 16384     ec_get retrieving
> >>>
> >>>events - will do max 3 fetches
> >>>
> >>>>>>>The hang happens after line 79352.  In this
> >>>
> >>>instance the message
> >>>
> >>>>>>>indicates the scheduler tried twice.  Other
> >>>
> >>>times, I get a timeout
> >>>
> >>>>>>>at this point.  In either case, the output
> >>>
> >>>pauses in the same
> >>>
> >>>>>>>manner that a call to qsub or qstat would.
> >>>>>>>
> >>>>>>>I have followed the optimization procedures
> >>>
> >>>listed on the website
> >>>
> >>>>>>>and they didn't seem to help (might have
> >>>
> >>>missed some though).
> >>>
> >>>>>>>I don't have any information from
> >>>
> >>>sge_qmaster.  I tried several
> >>>
> >>>>>>>different SGE_DEBUG_LEVEL settings, but
> >>>
> >>>sge_qmaster would always
> >>>
> >>>>>>>stop providing information after
> >>>
> >>>daemonizing.
> >>>
> >>>>>>>System configuration:
> >>>>>>>
> >>>>>>>Qmaster runs on Fedora Core 2, x86, (2.2 Ghz
> >>>
> >>>Xeon)
> >>>
> >>>>>>>clients (execd) run on Suse 9.1 x86_64, (3.2
> >>>
> >>>Ghz EM64T)
> >>>
> >>>>>>>SGE is configured to use old style spooling
> >>>
> >>>over NFS
> >>>
> >>>>>>>I can provide more info, I just don't know
> >>>
> >>>where to go from here.
> >>>
> >>>>>>>Thanks,
> >>>
> >>=== message truncated ===
> >>
> >>
> >>
> >>		
> >>__________________________________ 
> >>Do you Yahoo!? 
> >>All your favorites on one personal page  Try My Yahoo!
> >>http://my.yahoo.com
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list