[GE users] Long delay when submitting large jobs

Stephan Grell stephan.grell at sun.com
Tue Jan 18 08:13:36 GMT 2005


Craig Tierney wrote:

>On Mon, 2005-01-17 at 05:40, Stephan Grell - Sun Germany - SSG -
>Software Engineer wrote:
>  
>
>>Craig Tierney wrote:
>>    
>>
>>>On Fri, 2005-01-14 at 12:16, Hung-sheng Tsao wrote:
>>>  
>>>      
>>>
>>>>just try to understand
>>>>1)how big is the binary?, I assume that all binary is NFS mounted? 
>>>>    
>>>>        
>>>>
>>>This is all before the job script is run, so it has nothing to
>>>do with the binary.
>>>
>>>  
>>>      
>>>
>>>>2)how big is the input files?
>>>>    
>>>>        
>>>>
>>>See above.
>>>      
>>>
>>Sorry, but I did not find it in the emails. Could you help me?
>>    
>>
>
>Sorry, what I was trying to say is that the problem happens before
>the actual job starts.  So I didn't think that the binary size 
>mattered.  The job script could be of issue, but I thought that
>only ends up on the first MASTER host, so that wouldn't be affected
>by the size of the binary or the script.  
>
>The scripts are on order 500 bytes.
>  
>
Ups, sorry, I did not mean the size of the script, but the number of 
slaves this parallel
job has. But I found that in your next email.

>
>  
>
>>>  
>>>      
>>>
>>>>3)I ASSUME that the interconnect is gigabit? and server has one
>>>>gigabit that will support 128Node (256CPU) in a flat network?
>>>>    
>>>>        
>>>>
>>>Each rack (22 nodes) is gigE to a switch, which is the uplinked
>>>to a master switch.
>>>
>>>  
>>>      
>>>
>>>>4)what is the storage? HW raid array?
>>>>    
>>>>        
>>>>
>>>NFS currently.
>>>      
>>>
>>In an earlier email you stated that you use classic spooling over
>>NFS. 
>>To say it simple: that is bad for the performance. :-)
>>    
>>
>
>I know, why does the server block during the operations?  
>What if I had 4000 nodes and BDB?  It would still block.
>  
>
Yes, your are right. I also think, that this takes way too long, even 
for the current
implementation. Could you runn the scheduler in profiling mode and post the
output on this mailing list?
"qconf -msconf" "params profile=1"
You wrote, that you got a BDB spooling system as well... Would be nice 
to get the
output from both systems.
You can also use qping to get more information about the communication 
layer and
for how long it blocks and which threads are blocked.

>
>  
>
>>Is there a reason for not using local BDB spooling? During job start
>>are a lot of objects modified, and they are all spooled....
>>    
>>
>
>Failover.  Is this the problem?  I can work on a new solution that
>uses HA failover instead of the standard mechanism.
>
>Is there a way to convert a NFS spool to a BDB spool?
>
>
>  
>
>>The execd should also spool localy. What is the reason for not doing
>>it?
>>    
>>
>
>I can try and configure this.  I don't have disks, but I could
>put it on ram disk and just throw it away.
>  
>
Only the execd message files might be a problem.

Cheers,
Stephan

>
>  
>
>>>  
>>>      
>>>
>>>>5)any stageing activity between master and compute nodes?
>>>>    
>>>>        
>>>>
>>>No.
>>>
>>>I don't care if my job takes 10 minutes to start.  That isn't
>>>the problem.  It is that the batch system hangs during this time.
>>>That is should not do.  It is not dependent on the type of job, 
>>>just the number of cpus (nodes) used.
>>>
>>>Thanks,
>>>Craig
>>>
>>>
>>>
>>>
>>>  
>>>      
>>>
>>>>regards
>>>>
>>>>
>>>>    
>>>>        
>>>>
>>>It has nothing to do with the binary.  This is the time
>>>before the job script is actually launched.  I don't even
>>>think this time covers the prolog/epilog execution.  My
>>>prolog/epilog can run long (touches all nodes in parallel), but
>>>the batch system shouldn't be waiting on that.
>>>
>>>Craig
>>>
>>>
>>>
>>>  
>>>      
>>>
>>>>On Fri, 14 Jan 2005 11:29:58 -0700, Craig Tierney <ctierney at hpti.com> wrote:
>>>>    
>>>>        
>>>>
>>>>>I have been running SGE6.0u1 for a few months now on a new system.
>>>>>I have noticed very long delays, or even SGE hangs, when starting
>>>>>large jobs.  I just tried this on the latest CVS source and
>>>>>the problem persists.
>>>>>
>>>>>It appears that the hang while the job is moved from 'qw' to t.
>>>>>In general the system does continue to operate normally.  However
>>>>>the delays can be large, 30-60 seconds.  'Hang' is defined as
>>>>>system commands like qsub and qstat will delay until the job
>>>>>has finished migrating to the 't' status.  Sometimes the delays
>>>>>are long enough to get GDI failures.  Since qmaster is threaded,
>>>>>I wonder why I get the hangs.
>>>>>
>>>>>I have tried debugging the situation.  Either the hang is in qmaster,
>>>>>or sge_schedd is not printing enough information.
>>>>>
>>>>>Here is some of the text from the sge_schedd debug for a 256 cpu job
>>>>>using a cluster queue.
>>>>>
>>>>> 79347   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0129 R=slots U=2.000000
>>>>> 79348   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0130 R=slots U=2.000000
>>>>> 79349   7886 16384     J=179999.1 T=STARTING S=1105726988 D=43200 L=Q O=qecomp.q at e0131 R=slots U=2.000000
>>>>> 79350   7886 16384     Found NOW assignment
>>>>> 79351   7886 16384     reresolve port timeout in 536
>>>>> 79352   7886 16384     returning cached port value: 536
>>>>>scheduler tries to schedule job 179999.1 twice
>>>>> 79353   7886 16384        added 0 ticket orders for queued jobs
>>>>> 79354   7886 16384     SENDING 10 ORDERS TO QMASTER
>>>>> 79355   7886 16384     RESETTING BUSY STATE OF EVENT CLIENT
>>>>> 79356   7886 16384     reresolve port timeout in 536
>>>>> 79357   7886 16384     returning cached port value: 536
>>>>> 79358   7886 16384     ec_get retrieving events - will do max 3 fetches
>>>>>
>>>>>The hang happens after line 79352.  In this instance the message
>>>>>indicates the scheduler tried twice.  Other times, I get a timeout
>>>>>at this point.  In either case, the output pauses in the same
>>>>>manner that a call to qsub or qstat would.
>>>>>
>>>>>I have followed the optimization procedures listed on the website
>>>>>and they didn't seem to help (might have missed some though).
>>>>>
>>>>>I don't have any information from sge_qmaster.  I tried several
>>>>>different SGE_DEBUG_LEVEL settings, but sge_qmaster would always
>>>>>stop providing information after daemonizing.
>>>>>
>>>>>System configuration:
>>>>>
>>>>>Qmaster runs on Fedora Core 2, x86, (2.2 Ghz Xeon)
>>>>>clients (execd) run on Suse 9.1 x86_64, (3.2 Ghz EM64T)
>>>>>SGE is configured to use old style spooling over NFS
>>>>>
>>>>>I can provide more info, I just don't know where to go from here.
>>>>>
>>>>>Thanks,
>>>>>Craig
>>>>>
>>>>>---------------------------------------------------------------------
>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>      
>>>>>          
>>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>    
>>>>        
>>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>  
>>>      
>>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list