[GE users] Killed by Limit and Transfer to another Queue
reuti at staff.uni-marburg.de
Thu Dec 17 13:38:34 GMT 2009
Am 14.12.2009 um 04:24 schrieb kain2log:
> Dear Reuti,
> Thank you again for your reply.
>> Why on the same machine? One duty of the checkpointing interface is
>> to copy the local data from $TMPDIR to any intermediate storage and
>> then to the new node when the job continues on another machine.
> Same machine...
> well, Im not sure if our software execution can be suspended and be
> transferred & restart to another machine, but I know it can be
> suspended and restart for later.
> (by the way, we are using Cadence-spectre)
then it should be possible. You will need an application-level
checkpointing interface, and the migrate script should send your job
the necessary signal to make a checkpoint and exit (or you have to
kill the process on your own). The same script must then copy the
files to a location like /home/checkpoint/$JOB_ID. The migration
procedure should be triggered by a suspend of the job (hence you will
need something looking at the runtime of the jobs in the "short"
queue and suspend it when 5 minutes are reached - SGE will only kill
it, a checkpoint and migrate is not foreseen by default *. Maybe the
jobs queue request should also be changed to be the one for long jobs
There is a general Howto: http://gridengine.sunsource.net/howto/
checkpointing.html and a nice state diagram in http://
(Besides man sge_ckpt and man checkpoint)
When the job restarts, it's necessary to copy the relevant data back
to a node. Either in a queue prolog or the job script itself.
*) Maybe I should add this to my already existing checkpointing
issues: a setting h_migr/s_migr in the queues which will trigger the
migration. Although it could be h_susp/s_susp and use the usual
> I guess I need to read more about our software and SGE
> checkpointing too, this is a great idea.
>>> The wallclock limit will be so short (about 5mins) that it would be
>>> OK to restart the job. Would checkpoint's migration be applicable,
>>> or are there other work around?
>> Then you don't need any checkpointing facitility. Just reschedule the
>> job with: qmod -rj <job_id>
>> But why do you want to do this? When you know beforehand that the job
>> will run longer then 5 minutes, then you could request this run time
>> (-l h_rt) and SGE would automatically send the job to the correct
> Sometimes its hard to judge how long a simulation would take, also
> even if it is very obvious that a job would execute for more 5
> minutes, USERs would still queue on the one with the limit. So I
> want to do the queue transfer automatically.
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users