[GE users] Reservation of resources in duration of checkpoing so that no other job can be able to use those resources.

sgerns rajansrivastava83 at gmail.com
Wed Aug 18 07:55:31 BST 2010

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,

Thanks for your prompt reply & Concern.
I am looking for your help in this.

Please see the inline reply for understanding the scenario.

On Tue, Aug 17, 2010 at 5:47 PM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:

Am 17.08.2010 um 16:57 schrieb sgerns:

> I have a scenario here, I am trying to explain this in steps.
> 1. I have some jobs running in the cluster, based on the priority of the queues (Few queues have higher priority than others & so on).
> 2. I am also having a VIP.queue which is having the highest priority among all the queues. So ofcourse jobs which has been submitted through vip queue will definitely go up in queue, as I have shown below.
> 100    0.50500        job_name1      usr1       r              normal.q at exehost             32
> 101     0.50500      job_name2       usr1       r             normal.q at exehost              16
> 102     0.50500       job_name2      usr1       r             normal.q at exehost              32
> 103     0.50500        job_name3      usr2      r             normal.q at exehost              128
> 104     0.50973         job_name4      usr3     qw              VIP.q at exehost.              64
> 105     0.50973         job_name5      usr3      qw           normal.q at exehost         32
> 106     0.51514         job_name6        usr4     qw            normal.q at exehost         16
> 3. Now These VIP Jobs are very important jobs & I want these jobs to run ASAP, Hence I will checkpoint the lower priority jobs which are running right now.

how are you checkpointing these jobs, i.e. which checkpointing environment did you set up in SGE?

--->> I am automaticaly checkpointing the jobs here (I am writing some scripts (Wrapper of job scripts) which will decide which job to checkpoint based on the priority & then we will send the checkpointing signal to those jobs automatically) : Nothing is manual we are not doing it by hand.
If these automating scripts (Wrapper of the job scripts) decides which job to checkpoint as shown below e.g. job 102 & job 100

> 4. Suppose job id 100 & 102 I have selected for checkpointing & send checkpointing signal to those jobs, & It has taken 2 hours to chekpoint these jobs.
>    I do not want any other job to run on these resources during this duration of 2 hours.
>    Because there is a possibilty that small jobs can get the resources & start running

You mean, you e.g. suspend these jobs, which will be checkpointed as an result? As SGE thinks the queue is free again, it might be used by other jobs. Why not suspend the normal.q at exehostXY

--->>  Now we will send the checkpointing signals to these jobs (i.e. job id 100 & 102 ) suppose these jobs are very long jobs & running from 2 days so probably it can take say 2 hours to get checkpointed & release the resources
Now  my problem is during this duration I don want normal jobs to eat up these resources,(It would be the possibility that few resources are free earlier & any small job will take those resource which we had freed for VIP job.)

--->> Yes suspending the normal.q is the way we had thought about but There are many side effects of that as you are knowing.
if jobs will take more time to get checkpointed (say 5 hours) then all my queues are blocked & wont be able to do any processing in this duration, even if some other small resources are available for those jobs (e.g. 8 cores are freed by any other job which has finished these queues wont be able to run the jobs)

---->>>Is there any other way we can do this so that we will be able to reserve the resources for the duration of checkpointing so that no other job can be able to take & use that.

kindly give a thought on this.


PS: If you are doing all by hand w/o a checkpointing environment, the queue instances normal.q at exehostXY just need to be disabled with `qmod -d normal.q at exehostXY` AFAICS.
-- Reuti

> Kindly help me How can I be able to reserve the resources for the duration of checkpointing so that any other un important job can not able to start.


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].

More information about the gridengine-users mailing list