[GE users] Application-Level Checkpointing

Ravi Chandra Nallan Ravichandra.Nallan at Sun.COM
Tue Dec 18 06:37:25 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I thought the job could be migrated (and restarted) on suspend using the 
checkpoint configuration for when,
from man checkpoint,

  when
       The points of time when checkpoints are expected to be 
generated.  Valid values for this parameter are composed  by  the
       letters  s,  m,  x  and r and any combinations thereof without 
any separating character in between. The same letters are
       allowed for the -c option of the qsub(1) command which will 
overwrite the definitions in the used checkpointing environ-
       ment.  The meaning of the letters is defined as follows:

       s      A  job  is  checkpointed,  aborted and if possible 
migrated if the corresponding sge_execd(8) is shut down on the
              job's machine.

       m      Checkpoints are generated periodically at the 
min_cpu_interval interval defined by the queue (see  queue_conf(5))
              in which a job executes.

*       x      A  job  is  checkpointed, aborted and if possible 
migrated as soon as the job gets suspended (manually as well as
              automatically).
*
       r      A job will be rescheduled (not checkpointed) when the host 
on which the job  currently  runs  went  into  unknown
              state  and  the time interval reschedule_unknown (see 
sge_conf(5)) defined in the global/local cluster configura-
              tion will be exceeded.

regards,
~Ravi

Reuti wrote:
> Hi,
>
> Am 17.12.2007 um 19:21 schrieb Dev:
>
>> Is it that the running application should completely get killed 
>> before SGE decides to restart the job ?
>
> yes, but you have to kill it on your own in the migrate script. 
> Otherwise you might end up with the same job running twice.
>
> -- Reuti
>
>>
>>
>> */Dev <dev_hyd2001 at yahoo.com <mailto:dev_hyd2001 at yahoo.com>>/* wrote:
>>
>>     Hi,
>>
>>            Using Application Level Checkpointing and providing a
>>     migrate script to it, whats the criteria for the job to be
>>     restarted by SGE, for example once it has been unsuspended ?  My
>>     test job doing a sleep gets restarted by SGE but some other jobs
>>     don't seem to get restarted .( This is with SGE 6.0u6 though )
>>
>>     cheers
>>
>>     /Dev
>>
>>     ------------------------------------------------------------------------
>>     Be a better friend, newshound, and know-it-all with Yahoo!
>>     Mobile. Try it now.
>>     <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
>>
>>
>>
>> ------------------------------------------------------------------------
>> Looking for last minute shopping deals? Find them fast with Yahoo! 
>> Search. 
>> <http://us.rd.yahoo.com/evt=51734/*http://tools.search.yahoo.com/newsearch/category.php?category=shopping>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list