[GE users] job suspension

Viktor Oudovenko udo at physics.rutgers.edu
Thu May 22 22:12:52 BST 2008


Thank you very much, Reuti,

I have wrote my own scripts which nicely do job!
Just to share my  experience (recently I learned that it is not so common).
Epilog, resume and suspend scripts I generate in prolog on-fly for each job
and put them into $TMPDIR.
Regards,
v 

> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de] 
> Sent: Thursday, May 22, 2008 11:26
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] job suspension
> 
> Hi,
> 
> Am 22.05.2008 um 17:04 schrieb Viktor Oudovenko:
> 
> > It seems it does sends something as all processes on head 
> machine of 
> > suspended job get status "T" instead of original "S" plz 
> see attached 
> > file.
> > I used simply "ps  -axuf" command to get it last night.
> 
> state "T" is perfect as it means stopped. But for parallel 
> jobs it will only be done for the master process, not the 
> slaves - by intention, as it might often lead to a timeout 
> situation and the job will crash later on because it thinks 
> the communication to the node
> (s) broke.
> 
> If you know, that your parallel job could in principle 
> survive a suspension, then you will need to implement your 
> own suspend_method and resume_method to do it in some way on 
> all nodes for this parallel job.
> 
> -- Reuti
> 
> 
> > Was it helpful?
> >
> > Regards,
> > v
> >
> >> -----Original Message-----
> >> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> >> Sent: Thursday, May 22, 2008 4:15
> >> To: users at gridengine.sunsource.net
> >> Subject: Re: [GE users] job suspension
> >>
> >> Hi,
> >>
> >> Am 22.05.2008 um 08:24 schrieb Viktor Oudovenko:
> >>
> >>> Hello to everybody,
> >>>
> >>> any ideas why job suspension does not work?
> >>>
> >>> I have SGE 6.0u4 running on dual Athlon server.
> >>> Job is parallel (tight integration).
> >>> Queue status correctly changes to "S" but job continue to
> >> run (so both
> >>> jobs continue to run).
> >>> plz see below:
> >>>
> >>>  185157 5.00400 mpi_p1       user1        r     05/22/2008
> >> 02:06:11
> >>> wparallel1 at sub04n103              64
> >>>  185081 2.01388 mpi_p2       user2        S     05/21/2008
> >> 22:13:02
> >>> wparallel1_lp at sub04n103           64
> >>>
> >>> So , queue with "wparallel3_lp" (low priority) is defined as 
> >>> subordinated queue of  wparallel3.
> >>>
> >>> it seems to me when I created queue "_lp" and tested  job
> >> suspension
> >>> under my account is worked on x86 architecture and did 
> not work on 
> >>> opterons but now it does not work even on x86 machines.
> >>>
> >>> I found info in the net that 6.0u4 does have bug that after
> >> sgemaster
> >>> restart jobs are not suspended but I have not restarted 
> the master 
> >>> rather only computed nodes.
> >>
> >> SGE will send a -sigstop to the complete processgroup of 
> the job. So 
> >> please check, wether it's in the correct group.
> >>
> >> ps -e f -o pid,ppid,pgrp,command
> >>
> >> (f w/o -). - Reuti
> >>
> >>> If any questions plz let me know.
> >>> any ideas are welcome.
> >>>
> >>> best,
> >>> vic
> >>> p.s.
> >>> CLUSTER QUEUE    CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE
> >>>
> >> 
> ---------------------------------------------------------------------
> >> -
> >>> ---------
> >>> wparallel1                          1.82     64      0     64
> >>> 0      0
> >>> wparallel1_lp                     1.82     64      0     64
> >>> 64      0
> >>>
> >>>
> >>
> >>
> >> 
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>
> >> <PS.txt>
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list