[GE users] Long delay when submitting large jobs

Reuti reuti at staff.uni-marburg.de
Mon Feb 7 15:43:46 GMT 2005



Sean Dilda wrote:
> Rayson Ho wrote:
> 
>>> It isn't fixed though.  The qmaster is serializing tasks and
>>> blocking on communication to execd's.  If you want a system
>>> to scale, it shouldn't block.
>>
>>
>>
>> Agreed... that's why I put quotes around "fixes" in the other mail.
>>
>> IMO, there are 2 ways to fix it:
>>
>> 1) let 1 thread to start SGE's rshd for tight PE jobs, and the other 
>> one to
>> handle qstat, qsub, etc. I know qmaster is threaded, but I don't know how
>> we currently use the threads.
>>
>> 2) the other way is to add a new layer of software, so that tight PE and
>> non-tight PE jobs are started the same way to the qmaster. The new 
>> layer is
>> like LSF's PAM or PBS/Torque's mpiexec, which starts the slave parallel
>> tasks on the first execution host.
>>
>> I played with integrating SGE and mpiexec, I sent this mail to the 
>> mpiexec
>> list in 2003:
>> http://email.osc.edu/pipermail/mpiexec/2003/000521.html
>>
>> But it still relies on qmaster to start the rshds on the slave nodes. In
>> order to fix the long delay problem, we need to:
>>
>> - skip the code in qmaster to start the rshds for tight PE jobs
>> - let mpiexec get the list of hosts, and start the parallel tasks   
>> using the TM (Task Management) interface.
> 
> 
> Perhaps I'm missing something, but isn't the whole point of 
> tight-integration to allow SGE to control the slaves, and thus get 
> accounting information and be able to properly shut them down, etc?  It 
> sounds to me like what you're suggesting would break tight-integration.

IMO control and accounting are two points. Best is of course to have 
both, but having at least control you can shutdown the jobs in a proper 
way without any orphaned processes. Having correct accounting maybe 
necessary depending on your SGE installation. Is accounting working with 
any daemon based startup of a parallel environment, when the daemon 
forks-off and qrsh returns? - Reuti


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list