[GE users] job suspension

Reuti reuti at staff.uni-marburg.de
Thu May 22 16:26:00 BST 2008


Hi,

Am 22.05.2008 um 17:04 schrieb Viktor Oudovenko:

> It seems it does sends something as all processes on head machine of
> suspended job get status "T" instead of original "S" plz see  
> attached file.
> I used simply "ps  -axuf" command to get it last night.

state "T" is perfect as it means stopped. But for parallel jobs it  
will only be done for the master process, not the slaves - by  
intention, as it might often lead to a timeout situation and the job  
will crash later on because it thinks the communication to the node 
(s) broke.

If you know, that your parallel job could in principle survive a  
suspension, then you will need to implement your own suspend_method  
and resume_method to do it in some way on all nodes for this parallel  
job.

-- Reuti


> Was it helpful?
>
> Regards,
> v
>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Thursday, May 22, 2008 4:15
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] job suspension
>>
>> Hi,
>>
>> Am 22.05.2008 um 08:24 schrieb Viktor Oudovenko:
>>
>>> Hello to everybody,
>>>
>>> any ideas why job suspension does not work?
>>>
>>> I have SGE 6.0u4 running on dual Athlon server.
>>> Job is parallel (tight integration).
>>> Queue status correctly changes to "S" but job continue to
>> run (so both
>>> jobs continue to run).
>>> plz see below:
>>>
>>>  185157 5.00400 mpi_p1       user1        r     05/22/2008
>> 02:06:11
>>> wparallel1 at sub04n103              64
>>>  185081 2.01388 mpi_p2       user2        S     05/21/2008
>> 22:13:02
>>> wparallel1_lp at sub04n103           64
>>>
>>> So , queue with "wparallel3_lp" (low priority) is defined as
>>> subordinated queue of  wparallel3.
>>>
>>> it seems to me when I created queue "_lp" and tested  job
>> suspension
>>> under my account is worked on x86 architecture and did not work on
>>> opterons but now it does not work even on x86 machines.
>>>
>>> I found info in the net that 6.0u4 does have bug that after
>> sgemaster
>>> restart jobs are not suspended but I have not restarted the master
>>> rather only computed nodes.
>>
>> SGE will send a -sigstop to the complete processgroup of the
>> job. So please check, wether it's in the correct group.
>>
>> ps -e f -o pid,ppid,pgrp,command
>>
>> (f w/o -). - Reuti
>>
>>> If any questions plz let me know.
>>> any ideas are welcome.
>>>
>>> best,
>>> vic
>>> p.s.
>>> CLUSTER QUEUE    CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE
>>>
>> --------------------------------------------------------------------- 
>> -
>>> ---------
>>> wparallel1                          1.82     64      0     64
>>> 0      0
>>> wparallel1_lp                     1.82     64      0     64
>>> 64      0
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> <PS.txt>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list