[GE users] Long delay when submitting large jobs

Craig Tierney ctierney at hpti.com
Tue Jan 18 01:24:57 GMT 2005


On Mon, 2005-01-17 at 05:40, Stephan Grell - Sun Germany - SSG -
Software Engineer wrote:
> 

> Is there a reason for not using local BDB spooling? During job start
> are a lot of objects modified, and they are all spooled....

I reinstalled SGE temporarily using BDB to see if that
would improve startup times.  It took about 75 seconds
for a 512 processor job to transition from 'qw' to 't'.

The server was running DBD and it was installed on the local disk.
The $SGE_ROOT/default was still on NFS because the other nodes
do not have disks.  However, the IO to the filesystem from the clients
is small.

I ran strace on one of the sge_qmaster processes to try and
see what is going on.  I can't pick-out exactly what is going
on, but I did see that /etc/hosts was mmap'ed once for each
node.  I know that /etc/hosts should be cached, but I don't see
what gethostbyname (or whichever function it is) needs to be called
directly for each host.  The file shouldn't be changing during a
job startup.

There were many other mmap/munmap calls as well as calls to
gettimeofday.  However, I couldn't correlate it to exactly what
it was doing.

When qmaster starts up a job, does it talk to each host, one by
one, setting up the job information?  The scheduler actually picks
the nodes used, correct?  If qmaster is talking to each node,
is it done serially or are multiple requests sent out simultaneously?

Thanks,
Craig





> 
> The execd should also spool localy. What is the reason for not doing
> it?
> >   
> > > 5)any stageing activity between master and compute nodes?
> > >     
> > No.
> > 
> > I don't care if my job takes 10 minutes to start.  That isn't
> > the problem.  It is that the batch system hangs during this time.
> > That is should not do.  It is not dependent on the type of job, 
> > just the number of cpus (nodes) used.
> > 
> > Thanks,
> > Craig
> > 
> > 
> > 
> > 
> >   
> > > regards
> > > 
> > > 
> > >     
> > It has nothing to do with the binary.  This is the time
> > before the job script is actually launched.  I don't even
> > think this time covers the prolog/epilog execution.  My
> > prolog/epilog can run long (touches all nodes in parallel), but
> > the batch system shouldn't be waiting on that.
> > 
> > Craig
> > 
> > 
> > 
> >   
> > > On Fri, 14 Jan 2005 11:29:58 -0700, Craig Tierney <ctierney at hpti.com> wrote:
> > >     
> > > > I have been running SGE6.0u1 for a few months now on a new system.
> > > > I have noticed very long delays, or even SGE hangs, when starting
> > > > large jobs.  I just tried this on the latest CVS source and
> > > > the problem persists.
> > > > 
> > > > It appears that the hang while the job is moved from 'qw' to t.
> > > > In general the system does continue to operate normally.  However
> > > > the delays can be large, 30-60 seconds.  'Hang' is defined as
> > > > system commands like qsub and qstat will delay until the job
> > > > has finished migrating to the 't' status.  Sometimes the delays
> > > > are long enough to get GDI failures.  Since qmaster is threaded,
> > > > I wonder why I get the hangs.
> > > > 
> > > > I have tried debugging the situation.  Either the hang is in qmaster,
> > > > or sge_schedd is not printing enough information.
> > > > 
> > > > Here is some of the text from the sge_schedd debug for a 256 cpu job
> > > > using a cluster queue.
> > > > 
> > > >  79347   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0129 R=slots U=2.000000
> > > >  79348   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0130 R=slots U=2.000000
> > > >  79349   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0131 R=slots U=2.000000
> > > >  79350   7886 16384     Found NOW assignment
> > > >  79351   7886 16384     reresolve port timeout in 536
> > > >  79352   7886 16384     returning cached port value: 536
> > > > scheduler tries to schedule job 179999.1 twice
> > > >  79353   7886 16384        added 0 ticket orders for queued jobs
> > > >  79354   7886 16384     SENDING 10 ORDERS TO QMASTER
> > > >  79355   7886 16384     RESETTING BUSY STATE OF EVENT CLIENT
> > > >  79356   7886 16384     reresolve port timeout in 536
> > > >  79357   7886 16384     returning cached port value: 536
> > > >  79358   7886 16384     ec_get retrieving events - will do max 3 fetches
> > > > 
> > > > The hang happens after line 79352.  In this instance the message
> > > > indicates the scheduler tried twice.  Other times, I get a timeout
> > > > at this point.  In either case, the output pauses in the same
> > > > manner that a call to qsub or qstat would.
> > > > 
> > > > I have followed the optimization procedures listed on the website
> > > > and they didn't seem to help (might have missed some though).
> > > > 
> > > > I don't have any information from sge_qmaster.  I tried several
> > > > different SGE_DEBUG_LEVEL settings, but sge_qmaster would always
> > > > stop providing information after daemonizing.
> > > > 
> > > > System configuration:
> > > > 
> > > > Qmaster runs on Fedora Core 2, x86, (2.2 Ghz Xeon)
> > > > clients (execd) run on Suse 9.1 x86_64, (3.2 Ghz EM64T)
> > > > SGE is configured to use old style spooling over NFS
> > > > 
> > > > I can provide more info, I just don't know where to go from here.
> > > > 
> > > > Thanks,
> > > > Craig
> > > > 
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > > > 
> > > > 
> > > >       
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > > 
> > >     
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > 
> >   


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list