[GE users] Suspending a job

reuti reuti at staff.uni-marburg.de
Tue Jul 6 16:17:55 BST 2010


Am 06.07.2010 um 14:06 schrieb spow:

>> No. Your applications must support checkpointing also outside of SGE on its own already. Then SGE can be setup to trigger these already available checkpointing mechanism.
>> The checkpointing interface in combination with a subordination can be used to requeue a preempt job when a superordinated job starts though. But as resources are only released after the subordinated job is requeued, the superordinated job must have the ablility to start already although some resources are blocked by the subordinated job.
> I looked at the sample code given by the checkpointing howto : it is too 
> complicated to implement for end users, as a crashing job would consume 
> less time than re-writing all the code samples they are currently 
> executing. I thought the use of the checkpointing environment was much 
> easier to use !

this should usually be done by an admin and not by the enduser. Hence it should be implemented once for all users, which can then access this checkpointing setup by requesting the proper one in their `qsub` command the intended application.

>> When you are not satisfied with the above options, you will have to use a co-scheduler, which will requeue the job in question to free up resources. It also needs to take measures to avoid that the requeued job will restart immediately.
>> A parallel job should always preempt a seial one in your setup?
>> -- Reuti
> Could you further explain what a co-scheduler is, or give me an url ? I 
> have been unable to find decent google answers.
> As for the parallel jobs, they should indeed always preempt serial.
> Only in a few cases will the administrator decide to stop some of  them, 
> and those cases should be resolved manually.

I'm also not aware of any working framework for it. But the approach could be a cron job which runs e.g. every 5 minutes and checks:

- 1. are there any waiting parallel jobs?
- 2. if yes, set all serial jobs to hold (`qhold ...`), reschedules one of the running serial jobs (`qmod -rj ...`).
- 3. +   wait some time for the next scheduling cycle, then 1.
- 4. else "are there any waiting serial job on hold"?
- 5.     if yes, release them again (`qrls ...`).
- 6.     fi
- 7. fi

-- Reuti


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list