[GE users] Long delay when submitting large jobs

Sean Dilda agrajag at dragaera.net
Mon Feb 7 15:21:31 GMT 2005

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Rayson Ho wrote:
>>It isn't fixed though.  The qmaster is serializing tasks and
>>blocking on communication to execd's.  If you want a system
>>to scale, it shouldn't block.
> Agreed... that's why I put quotes around "fixes" in the other mail.
> IMO, there are 2 ways to fix it:
> 1) let 1 thread to start SGE's rshd for tight PE jobs, and the other one to
> handle qstat, qsub, etc. I know qmaster is threaded, but I don't know how
> we currently use the threads.
> 2) the other way is to add a new layer of software, so that tight PE and
> non-tight PE jobs are started the same way to the qmaster. The new layer is
> like LSF's PAM or PBS/Torque's mpiexec, which starts the slave parallel
> tasks on the first execution host.
> I played with integrating SGE and mpiexec, I sent this mail to the mpiexec
> list in 2003:
> http://email.osc.edu/pipermail/mpiexec/2003/000521.html
> But it still relies on qmaster to start the rshds on the slave nodes. In
> order to fix the long delay problem, we need to:
> - skip the code in qmaster to start the rshds for tight PE jobs
> - let mpiexec get the list of hosts, and start the parallel tasks 
>   using the TM (Task Management) interface.

Perhaps I'm missing something, but isn't the whole point of 
tight-integration to allow SGE to control the slaves, and thus get 
accounting information and be able to properly shut them down, etc?  It 
sounds to me like what you're suggesting would break tight-integration.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list