[GE users] Suspending a job

reuti reuti at staff.uni-marburg.de
Mon Jul 5 16:26:27 BST 2010

    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]


Am 05.07.2010 um 16:30 schrieb spow:

> I am a new user of both SGE and sysadministering.
> As a part of my internship, I am to configure the Grid Engine for a 
> small cluster.
> I'm having a hard time with this mission for I am more used to having 
> loads of documentation and tutorials, which are scarce for the SGE, and 

I would say, that there is much end user documentation. You can use the older PDF docs (with several hundreds of pages: http://docs.sun.com/app/docs/coll/1017.4 ), which is still valid but doesn't cover all of the new features. Other versions are covered in Wikis and the also the Howtos:


> that's why I'm using this mailing list for the first time.
> Most of the cluster is running parallel jobs, as well as a few 
> interactive ones.
> My idea is to separate the cluster into 3 overlapping areas :
> 1/ Parallel : One big queue on all the nodes, a few others for smaller 
> // jobs
> 2/ 'Simple area' : standard jobs are executed in here
> 3/ Interactive jobs : a few queues with only one host each

we have a similar setup, mostly to adjust the number of slots per type of job.

> But I also think it'd be wise to implement standard queues overlapping 
> the parallel ones, in case some nodes are not used or there are no // 
> jobs running at all. Thus they should each be subordinated to the 
> parallel queue they share nodes with.
> My problem is that this scheme is causing some trouble, relative to 
> context saving :
> I saw there is a checkpointing option, allowing to save the contest in 
> case a heavy job crashes.

No. Your applications must support checkpointing also outside of SGE on its own already. Then SGE can be setup to trigger these already available checkpointing mechanism.

The checkpointing interface in combination with a subordination can be used to requeue a preempt job when a superordinated job starts though. But as resources are only released after the subordinated job is requeued, the superordinated job must have the ablility to start already although some resources are blocked by the subordinated job.

> Hence my interrogation : Is there a way to use this checkpointing in 
> case a job is preempted ?
> For instance, I have standard jobs running in a queue sharing nodes with 
> the empty parallel one. As a parallel job is submitted, the subordinate 
> standard queue is suspended, and this has 2 bad implications :
> _ a part of the RAM is used by the simple job. I do not think it is 
> freed up, and this might cause problems for the upcoming parallel job, 
> for it does consume loads of RAM

Depends, when the suspended job's memory is swapped out once, you shouldn't have any further performance impacts.

> _ If the // job is taking too long, I cannot suspend it in order to 
> start the suspended simple job again
> To sum up :
> 1/ Is there a way to have the context saved, in order to free up the RAM ?
> 2/ Is it possible to reschedule a suspended job somewhere else while 
> keeping its context ?
> 3/ If none of the above is possible, what do you recommend I do ?

When you are not satisfied with the above options, you will have to use a co-scheduler, which will requeue the job in question to free up resources. It also needs to take measures to avoid that the requeued job will restart immediately.

A parallel job should always preempt a seial one in your setup?

-- Reuti

> Thanks for having read.
> Guillaume Quéré
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266171
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list