[GE users] Startup times and other issues with 6.0u3

Brian R Smith brian at cypher.acomp.usf.edu
Sat Mar 19 15:51:39 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,

Right, 't' only tells me what nodes have been allocated to run the job.  
The job does not start until 'r'.  Makes perfect sense.
However, I can attest to the 7-10 minute wait times.  When 
tight-integration is turned off, processes start up within a couple of 
seconds (plus the time it takes for the scheduler to "make its rounds").

Brian

Reuti wrote:

>Hi Brian,
>
>the status 't' is *not* a real-time display, whether the job is generating any 
>CPU load. But, I must admit, that I saw only a delay of about 1-2 minutes 
>before it changed to 'r'. Maybe it's related to the PE startup delay in u3.
>
>When the job is started, it may already been working although the status is 
>'t'. More informative is to look at the CPU usage on the node with "top" or "ps 
>-e f -o pid,time,command".
>
>CU - Reuti
>
>Quoting Brian R Smith <brian at cypher.acomp.usf.edu>:
>
>  
>
>>Sean,
>>
>>That is exacly what happens, allocation occurs and job waits in 't' 
>>state for 7-10 minutes.  I've reenabled "control slaves" because I 
>>figured I could live with this problem till u4 comes out (not that many 
>>people run 42 node, cluster spanning jobs).  My big concern right now is 
>>with running MM5 under SGE as there seems to be some problems with 
>>message passing.
>>
>>Brian
>>
>>Sean Dilda wrote:
>>
>>    
>>
>>>Ron Chen wrote:
>>>
>>>      
>>>
>>>>--- Brian R Smith <brian at cypher.acomp.usf.edu> wrote:
>>>>
>>>>        
>>>>
>>>>>You are absolutely the man.  Setting "control
>>>>>slaves" to false fixed all of my problems.
>>>>>          
>>>>>
>>>>
>>>>No, it is not fixing anything!
>>>>
>>>>"control slaves" means non-tight integration, so you
>>>>won't get process control/accounting of the slaves MPI
>>>>tasks.
>>>>
>>>>In SGE 6 update 4, the slow start problem was fixed.
>>>>But the original problem was that starting a 400-node
>>>>parallel job with tight integration takes several tens
>>>>seconds or something. But for your case it takes 10
>>>>minutes! So there is still something going on with
>>>>your configuration.
>>>>        
>>>>
>>>I've seem delays on the order of 5 minutes with 30 and 40-cpu jobs 
>>>that I believe are related to the bug that's fixed in u4.  I think the 
>>>people who only saw 10 or 20 second delays were lucky.
>>>
>>>Brian, when you say delay, what do you mean?  Is the job allocated 
>>>nodes, but sitting in 't' state for 10 minutes before it switches to 
>>>'r' ?  If so, then it does sound like the bug that will be fixed when 
>>>u4 comes out.  However, Ron is right.  Turning off control slaves 
>>>doesn't "fix" it, unless you don't care about tight-integration.
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>      
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list