[GE users] suspend/resume rsh/qrsh parallel task with SGE

reuti reuti at staff.uni-marburg.de
Mon Mar 9 15:00:11 GMT 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 09.03.2009 um 15:39 schrieb fboucher:

> <snip>
> This is WIEN2k, effectively. I had not problem to suspend such a  
> job with LSF just by replacing the standard rsh command by "lsrun - 
> m" but with this qrsh -inherit, it does not work as I was expecting.

It's for now by design, that suspending a parallel tasks is not  
forseen and must be implemented by the user. The idea was simply,  
that some parallel libraries might face a timeout when they are put  
to sleep and will fail anyway when they are triggered to continue. If  
it's not the case with Wien2k, then you can change SGE's source and  
should get it working.


> Do you really think that putting the task in the background will be  
> a problem. We use it actually like this and we have no problem of  
> overload as I generate the machine file from the host list of GE.

When you also honor the number of granted slots per machine it's ok.  
By putting many processes of jobs in the background, you can easily  
overload a machine. If it's not happening in your case, then there is  
nothing do.

===

An easier approach than suspending a parallel task might be, to  
submit these jobs with a priority (the one in the queue definition)  
of 19, as this will set the nice value for the processes. The  
resources are occupied anyway by a suspended task, and running it  
only at a reduced speed will avoid timing problems (most likely) for  
sure. Normal program in another queue you could then submit with the  
nice value of zero. Maybe it's worth to be checked before you  
recompile SGE.

-- Reuti


> However, I cannot imagine to work with only one node, many of our  
> calculations being very cpu/memory demanding.  I often even mix rsh  
> parallel and MPI parallel in WIEN2k when the memory/cpu requirement  
> is too much.
>
>> When you have a Tight Integration working, maybe you could put the  
>> necessary steps online (independent of the suspend issue)
>> -- Reuti
>>>
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?  
>>>> dsForumId=38&dsMessageId=74965 http://gridengine.sunsource.net/  
>>>> issues/show_bug.cgi?id=2740 -- Reuti  
>>>> ------------------------------------------------------ http://  
>>>> gridengine.sunsource.net/ds/viewMessage.do?  
>>>> dsForumId=38&dsMessageId=125423 To unsubscribe from this  
>>>> discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>> Florent --  
>>> -------------------------------------------------------------------- 
>>> -- --- | Florent BOUCHER | | | Institut des Matériaux Jean Rouxel  
>>> | Mailto:Florent.Boucher at cnrs-imn.fr | | 2, rue de la Houssini?re  
>>> | Phone: (33) 2 40 37 39 24 | | BP 32229 | Fax: (33) 2 40 37 39  
>>> 95 | | 44322 NANTES CEDEX 3 (FRANCE) | http://www.cnrs-imn.fr |  
>>> -------------------------------------------------------------------- 
>>> -- --- <Florent_Boucher.vcf>
>> ------------------------------------------------------ http:// 
>> gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=125477 To unsubscribe from this  
>> discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
>
> <Florent_Boucher.vcf>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=125527

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list