[GE users] Killed by Limit and Transfer to another Queue

reuti reuti at staff.uni-marburg.de
Thu Dec 17 13:38:34 GMT 2009


Hi,

Am 14.12.2009 um 04:24 schrieb kain2log:

> Dear Reuti,
> Thank you again for your reply.
>
>
>>
>> Why on the same machine? One duty of the checkpointing interface is
>> to copy the local data from $TMPDIR to any intermediate storage and
>> then to the new node when the job continues on another machine.
>
> Same machine...
> well, Im not sure if our software execution can be suspended and be  
> transferred & restart to another machine, but I know it can be  
> suspended and restart for later.
> (by the way, we are using Cadence-spectre)

then it should be possible. You will need an application-level  
checkpointing interface, and the migrate script should send your job  
the necessary signal to make a checkpoint and exit (or you have to  
kill the process on your own). The same script must then copy the  
files to a location like /home/checkpoint/$JOB_ID. The migration  
procedure should be triggered by a suspend of the job (hence you will  
need something looking at the runtime of the jobs in the "short"  
queue and suspend it when 5 minutes are reached - SGE will only kill  
it, a checkpoint and migrate is not foreseen by default *. Maybe the  
jobs queue request should also be changed to be the one for long jobs  
(by qalter)).

There is a general Howto: http://gridengine.sunsource.net/howto/ 
checkpointing.html and a nice state diagram in http:// 
gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf

(Besides man sge_ckpt and man checkpoint)

When the job restarts, it's necessary to copy the relevant data back  
to a node. Either in a queue prolog or the job script itself.

-- Reuti

*) Maybe I should add this to my already existing checkpointing  
issues: a setting h_migr/s_migr in the queues which will trigger the  
migration. Although it could be h_susp/s_susp and use the usual  
mechanism.


> I guess I need to read more about our software and SGE  
> checkpointing too, this is a great idea.
>
>>
>>
>>> The wallclock limit will be so short (about 5mins) that it would be
>>> OK to restart the job. Would checkpoint's migration be applicable,
>>> or are there other work around?
>>
>> Then you don't need any checkpointing facitility. Just reschedule the
>> job with: qmod -rj <job_id>
>>
>> But why do you want to do this? When you know beforehand that the job
>> will run longer then 5 minutes, then you could request this run time
>> (-l h_rt) and SGE would automatically send the job to the correct  
>> queue.
>
> Sometimes its hard to judge how long a simulation would take, also  
> even if it is very obvious that a job would execute for more 5  
> minutes, USERs would still queue on the one with the limit. So I  
> want to do the queue transfer automatically.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=233175
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=233914

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list