[GE users] Long delay when submitting large jobs
reuti at staff.uni-marburg.de
Mon Feb 7 15:43:46 GMT 2005
Sean Dilda wrote:
> Rayson Ho wrote:
>>> It isn't fixed though. The qmaster is serializing tasks and
>>> blocking on communication to execd's. If you want a system
>>> to scale, it shouldn't block.
>> Agreed... that's why I put quotes around "fixes" in the other mail.
>> IMO, there are 2 ways to fix it:
>> 1) let 1 thread to start SGE's rshd for tight PE jobs, and the other
>> one to
>> handle qstat, qsub, etc. I know qmaster is threaded, but I don't know how
>> we currently use the threads.
>> 2) the other way is to add a new layer of software, so that tight PE and
>> non-tight PE jobs are started the same way to the qmaster. The new
>> layer is
>> like LSF's PAM or PBS/Torque's mpiexec, which starts the slave parallel
>> tasks on the first execution host.
>> I played with integrating SGE and mpiexec, I sent this mail to the
>> list in 2003:
>> But it still relies on qmaster to start the rshds on the slave nodes. In
>> order to fix the long delay problem, we need to:
>> - skip the code in qmaster to start the rshds for tight PE jobs
>> - let mpiexec get the list of hosts, and start the parallel tasks
>> using the TM (Task Management) interface.
> Perhaps I'm missing something, but isn't the whole point of
> tight-integration to allow SGE to control the slaves, and thus get
> accounting information and be able to properly shut them down, etc? It
> sounds to me like what you're suggesting would break tight-integration.
IMO control and accounting are two points. Best is of course to have
both, but having at least control you can shutdown the jobs in a proper
way without any orphaned processes. Having correct accounting maybe
necessary depending on your SGE installation. Is accounting working with
any daemon based startup of a parallel environment, when the daemon
forks-off and qrsh returns? - Reuti
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users