[GE users] Long delay when submitting large jobs
Bogdan.Costescu at iwr.uni-heidelberg.de
Mon Feb 14 22:02:21 GMT 2005
[ Sorry for the long delay, I've been away last week... ]
On Tue, 8 Feb 2005, Reuti wrote:
> Some programs like Turbomole come with scripts, which will call
> mpirun many times during the iterations. In this case you would have
> to adjust the scripts of the program, and it would also not work out
> of the box.
Well, if they reference mpirun without the full path, the same trick
as used now with rsh can be applied.
> Well, let me explain this way: for now, SGE will catch the rsh,
> start the rshd, an use a 'real' rsh to start the communication.
> Instead of starting the rshd, why not directly start the program on
> the node as child of the shepered? No rshd in the way. Would this
> work with all parallel programs out of the box?
I wrote about this to the dev list on Wed, 17 Mar 2004 in a message
with subject "Wishlist" which was answered by Andreas Haas and then
another message by me, but they are absent from the archives (I
noticed the problem several days later and sent a notice...). I paste
below the relevant parts from both my messages; unfortunately I only
have my messages but not Andreas' one, so maybe he can complete this.
2. I would like to be able to start a process on a node which was
already allocated to a job, but without any forwarding of stdio. Using
'qrsh -inherit' does forwarding of stdio which sometimes disturbs more
than it helps. The processes would be created just as now as children
of the shepherd, so there is still tight control over them, but there
is no more rsh/rlogin daemon between the shepherd and the process. I
tried to implement something like this myself but I got lost in
lists... :-) I have seen mentioned in some of the html files from the
qsh and qexec directories that there was something called qrexec which
seems to have vanished completely and which might have been functioned
the way I wanted.
This functionality would enable or make easier at least the following
- to allow easier integration with LAM-MPI, maybe also MPICH when
using daemons. The LAM-MPI daemon does not need any stdio, all
communication is done through its own sockets. 'qrsh -inherit' is too
heavyweight, starting qrsh and rsh on master node and rshd on the
remote node which all live for the whole duration of the MPI job,
until the LAM daemon finishes.
- to allow reboots of the nodes as part of a job. I want to set up the
epilog/stop_proc_args to maybe do some updates or reboot the nodes
that were involved in this job. I don't expect any input or output
from the update/reboot process (in the case of reboot it would only
work up to some point anyway).
- to allow easy tunneling between nodes. If the jobs can run without
communication through sockets or with known ports, tunneling can
easily be done, without random port being chosen for the rsh/rlogin
communication of qrsh. Plus there is an extra connection between the
shepherd and qrsh...
On Wed, 17 Mar 2004, Andreas Haas wrote:
> In our current client command landscape it would be kind of a
> qsub -noshell -inherit
So, are there chances to see something like this soon ? :-)
Should I file an issue as well ?
> note that with qrsh -inhert task finish synchronization is simply done
> based on rsh finish. The qsub -inherit would have to do synchronization
> in a somewhat similar fashion like qsub -sync does it nowadays.
Well, I would like to have the option to specify if I want to wait for
it to finish and get a return code or if I just want to start it and
forget about it (from the point of view of the starter, as the process
will still be accounted for by the execd/shepherd on the execution
host), so I'd like both:
'qsub -inherit -noshell' and
'qsub -inherit -noshell -sync'
to be available. The version without '-sync' would allow starting the
process without having resources taken on the master node by something
that waits for the the process to finish. This might become important
for a parallel job that starts on tens or hundreds of nodes...
Some of the things mentioned above will certainly need to be
changed/dropped in case a TM-like API will be provided, but a year ago
I did not even dared to mention TM on this lists :-)
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users