[GE users] Long delay when submitting large jobs

Andy Schwierskott andy.schwierskott at sun.com
Tue Feb 15 13:33:35 GMT 2005


Bogdan,

>> There might be a misunderstanding about qmaster involvement
>
> As I wrote, I didn't look at the source for quite some time, so I did
> not want to give precise examples ;-)
>
> But you did...
>
>>     The root of the problem in my opinion is that for every task the chain of
>>     calling "qrsh -inherit", connecting to the execd, execd has to start the
>>     shepherd which in turn start the rshd (or sshd) which in trun start the
>>     parallel task has to be executed.
>
> and this is exactly what I would include in the description "a simple
> loop over some blocking calls". Apart from the involved procedure, the
> 'qrsh -inherit' calls should not happen linearly. With today's setup,
> both MPICH and LAM-MPI jobs are started in this linear fashion, simply
> because SGE does not offer a multi-spawn utilitity/API.
>
> This also leads to waste of CPU time, which is contrary to the purpose
> of a resource manager IMHO. If starting a process on a node takes 1
> second, for 100 nodes it will take 100 seconds - that's 100 seconds
> when 100 nodes are doing (almost) nothing; it's only after all
> processes have started that the parallel job passes through MPI_Init.
> Given that the slave nodes already know from the qmaster that the
> processes are allowed to run there, what's the point of contacting the
> execds one by one ? I certainly understand that there is no infinite
> scaling and therefore I don't expect that the job will start in 1
> second independent of the number of nodes, but I strongly believe that
> something under 10 seconds can be achieved for this particular
> example - and that's already one order of magnitude faster.

Notifying 400 execd's now takes less than 2 seconds!

Starting the "qrsh -inherit" in parallel is a task of the mprun command (or
whatever the parallel starter is) and cannot be solved

The overhead itself for a single task is not lightweight and becomes
inacceptable if repeatedly tasks have to be started or if the run time of a
single task is very short.

I don't yet understand why you think that a "multi-spawn utilitity/API" in
SGE would improve anything?

Andy

> Bogdan Costescu
>
> IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
> Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
> Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
> E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


Andy

--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Andy Schwierskott           Tel: +49 (0)941 3075-200 (x60200)
N1 Grid Engine Engineering  Fax: +49 (0)941 3075-222 (x60222)
Sun Microsystems GmbH
Dr.-Leo-Ritter-Str. 7       mailto:andy.schwierskott at sun.com
D-93049 Regensburg          http://www.sun.com/gridware

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list