[GE users] Long delay when submitting large jobs

Bogdan Costescu Bogdan.Costescu at iwr.uni-heidelberg.de
Mon Feb 14 22:02:21 GMT 2005

[ Sorry for the long delay, I've been away last week... ]

On Tue, 8 Feb 2005, Reuti wrote:

> Some programs like Turbomole come with scripts, which will call
> mpirun many times during the iterations. In this case you would have
> to adjust the scripts of the program, and it would also not work out
> of the box.

Well, if they reference mpirun without the full path, the same trick 
as used now with rsh can be applied.

> Well, let me explain this way: for now, SGE will catch the rsh,
> start the rshd, an use a 'real' rsh to start the communication.
> Instead of starting the rshd, why not directly start the program on
> the node as child of the shepered? No rshd in the way. Would this
> work with all parallel programs out of the box?

I wrote about this to the dev list on Wed, 17 Mar 2004 in a message 
with subject "Wishlist" which was answered by Andreas Haas and then 
another message by me, but they are absent from the archives (I 
noticed the problem several days later and sent a notice...). I paste 
below the relevant parts from both my messages; unfortunately I only 
have my messages but not Andreas' one, so maybe he can complete this.

2. I would like to be able to start a process on a node which was
already allocated to a job, but without any forwarding of stdio. Using
'qrsh -inherit' does forwarding of stdio which sometimes disturbs more
than it helps. The processes would be created just as now as children 
of the shepherd, so there is still tight control over them, but there 
is no more rsh/rlogin daemon between the shepherd and the process. I 
tried to implement something like this myself but I got lost in 
lists... :-) I have seen mentioned in some of the html files from the 
qsh and qexec directories that there was something called qrexec which 
seems to have vanished completely and which might have been functioned 
the way I wanted.
This functionality would enable or make easier at least the following 
3 cases:
- to allow easier integration with LAM-MPI, maybe also MPICH when
using daemons. The LAM-MPI daemon does not need any stdio, all
communication is done through its own sockets. 'qrsh -inherit' is too
heavyweight, starting qrsh and rsh on master node and rshd on the
remote node which all live for the whole duration of the MPI job,
until the LAM daemon finishes.
- to allow reboots of the nodes as part of a job. I want to set up the 
epilog/stop_proc_args to maybe do some updates or reboot the nodes 
that were involved in this job. I don't expect any input or output 
from the update/reboot process (in the case of reboot it would only 
work up to some point anyway).
- to allow easy tunneling between nodes. If the jobs can run without
communication through sockets or with known ports, tunneling can
easily be done, without random port being chosen for the rsh/rlogin
communication of qrsh. Plus there is an extra connection between the
shepherd and qrsh...
On Wed, 17 Mar 2004, Andreas Haas wrote:
> In our current client command landscape it would be kind of a
>    qsub -noshell -inherit

So, are there chances to see something like this soon ? :-)
Should I file an issue as well ?

> note that with qrsh -inhert task finish synchronization is simply done
> based on rsh finish. The qsub -inherit would have to do synchronization
> in a somewhat similar fashion like qsub -sync does it nowadays.

Well, I would like to have the option to specify if I want to wait for
it to finish and get a return code or if I just want to start it and
forget about it (from the point of view of the starter, as the process
will still be accounted for by the execd/shepherd on the execution
host), so I'd like both:

'qsub -inherit -noshell' and
'qsub -inherit -noshell -sync'

to be available. The version without '-sync' would allow starting the
process without having resources taken on the master node by something
that waits for the the process to finish. This might become important
for a parallel job that starts on tens or hundreds of nodes...

Some of the things mentioned above will certainly need to be 
changed/dropped in case a TM-like API will be provided, but a year ago 
I did not even dared to mention TM on this lists :-)

Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list