[GE users] suspend/resume rsh/qrsh parallel task with SGE

fboucher Florent.Boucher at cnrs-imn.fr
Mon Mar 9 14:39:26 GMT 2009

reuti a écrit :

Hi Florent,

Am 09.03.2009 um 14:57 schrieb fboucher:

reuti a écrit :

Hi, Am 09.03.2009 um 10:46 schrieb fboucher:

I would like to be able to suspend parallel task that are not
based on MPI communications. The main script, that runs on the
master, start child processes using rsh (or ssh) on different
nodes. All those tasks are independent and can be done in
parallel (no communications between them). However, one need to
finish all of them before continuing the whole job. I would like
to be able to suspend all the job (as one can do with mpitask).
At the moment, the SIGTSTP or SIGSTOP signal that is send using
qmod -sj. However, the child processes generated by the master
script completely ignore this SIGNAL (it is not trap by rsh/qrsh
nor ssh). Does a way exist to send directly this SIGTSTP signal
to all the child process created by the master script (or to trap
it with the rsh/ssh command) ?

a patch was on the list some time ago (of course, you need a tight
integration of the parallel application then):

I will update and see if it helps (we have 6.1u3 at the moment).
However, do you think this patch will solve the case where doing
qmod -sj $JOBID as no effect on the child processes ?

I think so. Whether the supension is triggered by a subordination or
qmod, I would assume they use the same routine to deliver the signals.

Also, what do you call a tight integration ? I am quite new with GE
and not so familiar with.
I use specific parallel mpich environment to submit the job,
capture the list of nodes and processors to generate my own machine
files and then use it to start the remote task using commands like:
/opt/sge/bin/lx24-amd64/qrsh -inherit n001 lapw1c dnlapw1_1.def

Exactly a "qrsh -inherit .." is fine to grant SGE control of the
started slave tasks. Looks like Wien2k. I checked this some time ago,
and in the end we used it only on one and the same node in parallel,
as there are several calls in the scripts which are put in the
background with bash's & and might overload a node (i.e. use more
slots then granted).

This is WIEN2k, effectively. I had not problem to suspend such a job with LSF just by replacing the standard rsh command by "lsrun -m" but with this qrsh -inherit, it does not work as I was expecting.
Do you really think that putting the task in the background will be a problem. We use it actually like this and we have no problem of overload as I generate the machine file from the host list of GE. However, I cannot imagine to work with only one node, many of our calculations being very cpu/memory demanding.  I often even mix rsh parallel and MPI parallel in WIEN2k when the memory/cpu requirement is too much.

When you have a Tight Integration working, maybe you could put the
necessary steps online (independent of the suspend issue)

-- Reuti

--- | Florent BOUCHER | | | Institut des Matériaux Jean Rouxel |
Mailto:Florent.Boucher at cnrs-imn.fr | | 2, rue de la Houssini?re |
Phone: (33) 2 40 37 39 24 | | BP 32229 | Fax: (33) 2 40 37 39 95 |
| 44322 NANTES CEDEX 3 (FRANCE) | http://www.cnrs-imn.fr |
--- <Florent_Boucher.vcf>


