[GE users] Startup times and other issues with 6.0u3

Reuti reuti at staff.uni-marburg.de
Sat Mar 19 15:28:05 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Brian,

the status 't' is *not* a real-time display, whether the job is generating any 
CPU load. But, I must admit, that I saw only a delay of about 1-2 minutes 
before it changed to 'r'. Maybe it's related to the PE startup delay in u3.

When the job is started, it may already been working although the status is 
't'. More informative is to look at the CPU usage on the node with "top" or "ps 
-e f -o pid,time,command".

CU - Reuti

Quoting Brian R Smith <brian at cypher.acomp.usf.edu>:

> Sean,
> 
> That is exacly what happens, allocation occurs and job waits in 't' 
> state for 7-10 minutes.  I've reenabled "control slaves" because I 
> figured I could live with this problem till u4 comes out (not that many 
> people run 42 node, cluster spanning jobs).  My big concern right now is 
> with running MM5 under SGE as there seems to be some problems with 
> message passing.
> 
> Brian
> 
> Sean Dilda wrote:
> 
> > Ron Chen wrote:
> >
> >> --- Brian R Smith <brian at cypher.acomp.usf.edu> wrote:
> >>
> >>> You are absolutely the man.  Setting "control
> >>> slaves" to false fixed all of my problems.
> >>
> >>
> >>
> >> No, it is not fixing anything!
> >>
> >> "control slaves" means non-tight integration, so you
> >> won't get process control/accounting of the slaves MPI
> >> tasks.
> >>
> >> In SGE 6 update 4, the slow start problem was fixed.
> >> But the original problem was that starting a 400-node
> >> parallel job with tight integration takes several tens
> >> seconds or something. But for your case it takes 10
> >> minutes! So there is still something going on with
> >> your configuration.
> >
> >
> > I've seem delays on the order of 5 minutes with 30 and 40-cpu jobs 
> > that I believe are related to the bug that's fixed in u4.  I think the 
> > people who only saw 10 or 20 second delays were lucky.
> >
> > Brian, when you say delay, what do you mean?  Is the job allocated 
> > nodes, but sitting in 't' state for 10 minutes before it switches to 
> > 'r' ?  If so, then it does sound like the bug that will be fixed when 
> > u4 comes out.  However, Ron is right.  Turning off control slaves 
> > doesn't "fix" it, unless you don't care about tight-integration.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list