[GE users] suspend/resume rsh/qrsh parallel task with SGE

reuti reuti at staff.uni-marburg.de
Mon Mar 9 14:07:31 GMT 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi Florent,

Am 09.03.2009 um 14:57 schrieb fboucher:

> reuti a écrit :
>>
>> Hi, Am 09.03.2009 um 10:46 schrieb fboucher:
>>>
>>> I would like to be able to suspend parallel task that are not  
>>> based on MPI communications. The main script, that runs on the  
>>> master, start child processes using rsh (or ssh) on different  
>>> nodes. All those tasks are independent and can be done in  
>>> parallel (no communications between them). However, one need to  
>>> finish all of them before continuing the whole job. I would like  
>>> to be able to suspend all the job (as one can do with mpitask).  
>>> At the moment, the SIGTSTP or SIGSTOP signal that is send using  
>>> qmod -sj. However, the child processes generated by the master  
>>> script completely ignore this SIGNAL (it is not trap by rsh/qrsh  
>>> nor ssh). Does a way exist to send directly this SIGTSTP signal  
>>> to all the child process created by the master script (or to trap  
>>> it with the rsh/ssh command) ?
>> a patch was on the list some time ago (of course, you need a tight  
>> integration of the parallel application then):
> I will update and see if it helps (we have 6.1u3 at the moment).  
> However, do you think this patch will solve the case where doing  
> qmod -sj $JOBID as no effect on the child processes ?

I think so. Whether the supension is triggered by a subordination or  
qmod, I would assume they use the same routine to deliver the signals.


> Also, what do you call a tight integration ? I am quite new with GE  
> and not so familiar with.
> I use specific parallel mpich environment to submit the job,  
> capture the list of nodes and processors to generate my own machine  
> files and then use it to start the remote task using commands like:
> /opt/sge/bin/lx24-amd64/qrsh -inherit n001 lapw1c dnlapw1_1.def

Exactly a "qrsh -inherit .." is fine to grant SGE control of the  
started slave tasks. Looks like Wien2k. I checked this some time ago,  
and in the end we used it only on one and the same node in parallel,  
as there are several calls in the scripts which are put in the  
background with bash's & and might overload a node (i.e. use more  
slots then granted).

When you have a Tight Integration working, maybe you could put the  
necessary steps online (independent of the suspend issue)?

-- Reuti


>> http://gridengine.sunsource.net/ds/viewMessage.do?  
>> dsForumId=38&dsMessageId=74965 http://gridengine.sunsource.net/ 
>> issues/show_bug.cgi?id=2740 -- Reuti  
>> ------------------------------------------------------ http:// 
>> gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=125423 To unsubscribe from this  
>> discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> Florent
>
> --  
> ---------------------------------------------------------------------- 
> --- | Florent BOUCHER | | | Institut des Matériaux Jean Rouxel |  
> Mailto:Florent.Boucher at cnrs-imn.fr | | 2, rue de la Houssini?re |  
> Phone: (33) 2 40 37 39 24 | | BP 32229 | Fax: (33) 2 40 37 39 95 |  
> | 44322 NANTES CEDEX 3 (FRANCE) | http://www.cnrs-imn.fr |  
> ---------------------------------------------------------------------- 
> --- <Florent_Boucher.vcf>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=125477

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list