[GE users] Startup times and other issues with 6.0u3

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Mon Mar 21 10:25:30 GMT 2005


Brian,

sorry for the late reply. I am a bit shocked to read about 7-10 min. startup
time. The u4 fix might help, but I did not see delays longer than 1 min with
much bigger jobs.

You could use qping -dump for monitoring the traffic between the qmaster and
the execd. It will give you the time stamps when which client send what.

Is it possible to post the output for an empty cluster with a starting mpi
job?

Stephan

Brian R Smith wrote:

>Reuti,
>
>Right, 't' only tells me what nodes have been allocated to run the job.  
>The job does not start until 'r'.  Makes perfect sense.
>However, I can attest to the 7-10 minute wait times.  When 
>tight-integration is turned off, processes start up within a couple of 
>seconds (plus the time it takes for the scheduler to "make its rounds").
>
>Brian
>
>Reuti wrote:
>
>  
>
>>Hi Brian,
>>
>>the status 't' is *not* a real-time display, whether the job is generating any 
>>CPU load. But, I must admit, that I saw only a delay of about 1-2 minutes 
>>before it changed to 'r'. Maybe it's related to the PE startup delay in u3.
>>
>>When the job is started, it may already been working although the status is 
>>'t'. More informative is to look at the CPU usage on the node with "top" or "ps 
>>-e f -o pid,time,command".
>>
>>CU - Reuti
>>
>>Quoting Brian R Smith <brian at cypher.acomp.usf.edu>:
>>
>> 
>>
>>    
>>
>>>Sean,
>>>
>>>That is exacly what happens, allocation occurs and job waits in 't' 
>>>state for 7-10 minutes.  I've reenabled "control slaves" because I 
>>>figured I could live with this problem till u4 comes out (not that many 
>>>people run 42 node, cluster spanning jobs).  My big concern right now is 
>>>with running MM5 under SGE as there seems to be some problems with 
>>>message passing.
>>>
>>>Brian
>>>
>>>Sean Dilda wrote:
>>>
>>>   
>>>
>>>      
>>>
>>>>Ron Chen wrote:
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>>>--- Brian R Smith <brian at cypher.acomp.usf.edu> wrote:
>>>>>
>>>>>       
>>>>>
>>>>>          
>>>>>
>>>>>>You are absolutely the man.  Setting "control
>>>>>>slaves" to false fixed all of my problems.
>>>>>>         
>>>>>>
>>>>>>            
>>>>>>
>>>>>No, it is not fixing anything!
>>>>>
>>>>>"control slaves" means non-tight integration, so you
>>>>>won't get process control/accounting of the slaves MPI
>>>>>tasks.
>>>>>
>>>>>In SGE 6 update 4, the slow start problem was fixed.
>>>>>But the original problem was that starting a 400-node
>>>>>parallel job with tight integration takes several tens
>>>>>seconds or something. But for your case it takes 10
>>>>>minutes! So there is still something going on with
>>>>>your configuration.
>>>>>       
>>>>>
>>>>>          
>>>>>
>>>>I've seem delays on the order of 5 minutes with 30 and 40-cpu jobs 
>>>>that I believe are related to the bug that's fixed in u4.  I think the 
>>>>people who only saw 10 or 20 second delays were lucky.
>>>>
>>>>Brian, when you say delay, what do you mean?  Is the job allocated 
>>>>nodes, but sitting in 't' state for 10 minutes before it switches to 
>>>>'r' ?  If so, then it does sound like the bug that will be fixed when 
>>>>u4 comes out.  However, Ron is right.  Turning off control slaves 
>>>>doesn't "fix" it, unless you don't care about tight-integration.
>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>     
>>>>
>>>>        
>>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>   
>>>
>>>      
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>> 
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list