[GE users] Application-Level Checkpointing
Ravi Chandra Nallan
Ravichandra.Nallan at Sun.COM
Tue Dec 18 06:37:25 GMT 2007
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
I thought the job could be migrated (and restarted) on suspend using the
checkpoint configuration for when,
from man checkpoint,
when
The points of time when checkpoints are expected to be
generated. Valid values for this parameter are composed by the
letters s, m, x and r and any combinations thereof without
any separating character in between. The same letters are
allowed for the -c option of the qsub(1) command which will
overwrite the definitions in the used checkpointing environ-
ment. The meaning of the letters is defined as follows:
s A job is checkpointed, aborted and if possible
migrated if the corresponding sge_execd(8) is shut down on the
job's machine.
m Checkpoints are generated periodically at the
min_cpu_interval interval defined by the queue (see queue_conf(5))
in which a job executes.
* x A job is checkpointed, aborted and if possible
migrated as soon as the job gets suspended (manually as well as
automatically).
*
r A job will be rescheduled (not checkpointed) when the host
on which the job currently runs went into unknown
state and the time interval reschedule_unknown (see
sge_conf(5)) defined in the global/local cluster configura-
tion will be exceeded.
regards,
~Ravi
Reuti wrote:
> Hi,
>
> Am 17.12.2007 um 19:21 schrieb Dev:
>
>> Is it that the running application should completely get killed
>> before SGE decides to restart the job ?
>
> yes, but you have to kill it on your own in the migrate script.
> Otherwise you might end up with the same job running twice.
>
> -- Reuti
>
>>
>>
>> */Dev <dev_hyd2001 at yahoo.com <mailto:dev_hyd2001 at yahoo.com>>/* wrote:
>>
>> Hi,
>>
>> Using Application Level Checkpointing and providing a
>> migrate script to it, whats the criteria for the job to be
>> restarted by SGE, for example once it has been unsuspended ? My
>> test job doing a sleep gets restarted by SGE but some other jobs
>> don't seem to get restarted .( This is with SGE 6.0u6 though )
>>
>> cheers
>>
>> /Dev
>>
>> ------------------------------------------------------------------------
>> Be a better friend, newshound, and know-it-all with Yahoo!
>> Mobile. Try it now.
>> <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20>
>>
>>
>>
>> ------------------------------------------------------------------------
>> Looking for last minute shopping deals? Find them fast with Yahoo!
>> Search.
>> <http://us.rd.yahoo.com/evt=51734/*http://tools.search.yahoo.com/newsearch/category.php?category=shopping>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users
mailing list