[GE users] Long delay when submitting large jobs

Bogdan Costescu Bogdan.Costescu at iwr.uni-heidelberg.de
Tue Feb 15 14:16:35 GMT 2005

On Tue, 15 Feb 2005, Andy Schwierskott wrote:

> Notifying 400 execd's now takes less than 2 seconds!

Yes, but notifying is different from actually starting the job. I
don't care that much about the SGE internals as I do about the visible
effects, like the time it takes from the moment the job is dispatched
until the moment the parallel job is actually running. If as in my
example this takes 100 seconds for 100 nodes, this is annoying.

> Starting the "qrsh -inherit" in parallel is a task of the mprun
> command (or whatever the parallel starter is) and cannot be solved

Why ? Cannot the execds take a command to run just as they take the
allowance for the job ? Or cannot 'qrsh -inherit' (or something like
it) receive a list of nodes and start remote processes on all of them
in the most efficient manner possible ?

> The overhead itself for a single task is not lightweight

Are you talking about the overhead on the slave node or on the master
node ? The master will be the only one that has to deal with making
parallel connections; the slave nodes will receive exactly one
connection each, which would not lead to overload IMHO.

Nevertheless, I'm interested in the parallel job starting as soon as
possible, so I'd rather have the 100 nodes work hard for 5-10 seconds
rather than have the 100 nodes do nothing for 100 seconds.

> I don't yet understand why you think that a "multi-spawn utilitity/API" in
> SGE would improve anything?

Because an integrated (scheduling/execution/monitoring/etc.) system
like SGE is the only place where this can happen properly. I can
already do a linear startup with a simple shell script from inside or
outside SGE. In order for a proper multi-spawn utility/API, I need
daemons that are already running on the nodes - SGE already offers

It might not be the best example, but you can look at LAM's way of
starting jobs which has already solved exactly the same problem. They
take some time (or maybe not if using TM) at the beginning to start
the daemons, but then the parallel jobs are started (via mpirun) very
fast; mpirun talks to the local LAM daemon which then broadcasts the
information to all the other daemons that should start processes.

Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list