[GE users] Application-Level Checkpointing

Reuti reuti at staff.uni-marburg.de
Tue Dec 18 09:30:45 GMT 2007


Hi,

Am 18.12.2007 um 07:37 schrieb Ravi Chandra Nallan:

> I thought the job could be migrated (and restarted) on suspend  
> using the checkpoint configuration for when,
> from man checkpoint,
>
>  when
>       The points of time when checkpoints are expected to be  
> generated.  Valid values for this parameter are composed  by  the
>       letters  s,  m,  x  and r and any combinations thereof  
> without any separating character in between. The same letters are
>       allowed for the -c option of the qsub(1) command which will  
> overwrite the definitions in the used checkpointing environ-
>       ment.  The meaning of the letters is defined as follows:
>
>       s      A  job  is  checkpointed,  aborted and if possible  
> migrated if the corresponding sge_execd(8) is shut down on the
>              job's machine.
>
>       m      Checkpoints are generated periodically at the  
> min_cpu_interval interval defined by the queue (see  queue_conf(5))
>              in which a job executes.
>
> *       x      A  job  is  checkpointed, aborted and if possible  
> migrated as soon as the job gets suspended (manually as well as
>              automatically).
> *
>       r      A job will be rescheduled (not checkpointed) when the  
> host on which the job  currently  runs  went  into  unknown
>              state  and  the time interval reschedule_unknown (see  
> sge_conf(5)) defined in the global/local cluster configura-
>              tion will be exceeded.

this is not really exactly what is triggered. Best is the state  
diagram in http://gridengine.sunsource.net/howto/APSTC- 
TB-2004-005.pdf There are already some issues in Issuezilla about  
this different behavior in contrast to the documented (all 2037 to  
2045).

- In the case of "s", it's clear that there can't be a checkpoint  
created, as the contact to the execd is already lost. Instead it will  
be rescheduled when the execd reappears again (issue 2045).

- In case of "x" the migration script has to do the checkpointing  
(maybe by calling the script already defined for checkpointing on the  
"m" event), and then kill all processes. SGE will not do anything to  
checkpoint or remove the processes on its own (issue 2037). Just the  
"migr_command" is called - nothing more.

- IIRC there is no need to unsuspend a job, when there is a  
checkpointing environment used for it. It will be rescheduled  
automatically.

Also interesting: http://gridengine.sunsource.net/howto/ 
checkpointing.html

-- Reuti


> regards,
> ~Ravi
>
> Reuti wrote:
>> Hi,
>>
>> Am 17.12.2007 um 19:21 schrieb Dev:
>>
>>> Is it that the running application should completely get killed  
>>> before SGE decides to restart the job ?
>>
>> yes, but you have to kill it on your own in the migrate script.  
>> Otherwise you might end up with the same job running twice.
>>
>> -- Reuti
>>
>>>
>>>
>>> */Dev <dev_hyd2001 at yahoo.com <mailto:dev_hyd2001 at yahoo.com>>/*  
>>> wrote:
>>>
>>>     Hi,
>>>
>>>            Using Application Level Checkpointing and providing a
>>>     migrate script to it, whats the criteria for the job to be
>>>     restarted by SGE, for example once it has been unsuspended ?  My
>>>     test job doing a sleep gets restarted by SGE but some other jobs
>>>     don't seem to get restarted .( This is with SGE 6.0u6 though )
>>>
>>>     cheers
>>>
>>>     /Dev
>>>
>>>      
>>> -------------------------------------------------------------------- 
>>> ----
>>>     Be a better friend, newshound, and know-it-all with Yahoo!
>>>     Mobile. Try it now.
>>>     <http://us.rd.yahoo.com/evt=51733/*http:// 
>>> mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> ----
>>> Looking for last minute shopping deals? Find them fast with  
>>> Yahoo! Search. <http://us.rd.yahoo.com/evt=51734/*http:// 
>>> tools.search.yahoo.com/newsearch/category.php?category=shopping>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list