[GE users] Long delay when submitting large jobs

Andy Schwierskott andy.schwierskott at sun.com
Tue Feb 15 11:26:09 GMT 2005


>> And if we have the TM library available for SGE now, what kind of changes
>> would be needed to do what you sent to the dev last year??
> No change, the main point was the possibility to spawn tasks on slave
> nodes without the whole rsh/rshd overhead and usage of random ports
> and, as you already wrote, tm_spawn() can do this.
> I would like however to emphasize one thing that is very important for
> parallel jobs: tm_spawn() or some similar functionality should be
> implemented efficiently, to allow starting up of a parallel job on
> hundreds of nodes - a simple loop over some blocking calls (e.g.
> waiting for start-up confirmation) should be available for those that
> do not want a flurry of network activity on the master node, but a
> smart routine that sends lots of start-up requests and then waits for
> confirmations should be the default. This might mean some changes to
> the way the job information is kept in SGE, I haven't looked at the
> source since March-April 2004...

There might be a misunderstanding about qmaster involvement: quick spawning
of new tasks and involvement of the master node for starting new tightly
integrated tasks.

1. qmaster is involved only once during the startup of a PE job - at the
    beginning when qmaster gets the order from scheduler about the PE job
    start it sends out once a notification to all execd's that they accept
    the tasks. There was an error in the implementation resulting in a N**2
    requirement in qmaster with sending out these notifications. This problem
    is now fixed and will be part of the next patch and can be tried out
    already today as described in Stephan's mail.

2. Overhead for a single "qrsh -inherit" - this become expensive for very
    short parallel tasks and for repeatedly started parallel taks which
    connect to the same hosts, resulting in the known problems that e.g. a
    qmake job can take significantly longer than a non-batch parallel job.

    The root of the problem in my opinion is that for every task the chain of
    calling "qrsh -inherit", connecting to the execd, execd has to start the
    shepherd which in turn start the rshd (or sshd) which in trun start the
    parallel task has to be executed.

Technically there are certainly several approaches how the shepherd can be
made the parent process of these child task without the need of executing
this chain for every single parallel start. The solution will require, no
doubt a significant amount of effort due to the different requirements
(stdin/out redirection without the need for buffering, security
aspects[ability to plugin e.g. ssh), transparent use for MPI, parallel make
...), getting the accounting information, being able to proper killing or
suspending the tasks).


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list