[GE users] Suspending a job

spow guillaume.quere at fr.thalesgroup.com
Mon Jul 5 15:30:55 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

I am a new user of both SGE and sysadministering.
As a part of my internship, I am to configure the Grid Engine for a 
small cluster.

I'm having a hard time with this mission for I am more used to having 
loads of documentation and tutorials, which are scarce for the SGE, and 
that's why I'm using this mailing list for the first time.

Most of the cluster is running parallel jobs, as well as a few 
interactive ones.

My idea is to separate the cluster into 3 overlapping areas :
1/ Parallel : One big queue on all the nodes, a few others for smaller 
// jobs
2/ 'Simple area' : standard jobs are executed in here
3/ Interactive jobs : a few queues with only one host each

But I also think it'd be wise to implement standard queues overlapping 
the parallel ones, in case some nodes are not used or there are no // 
jobs running at all. Thus they should each be subordinated to the 
parallel queue they share nodes with.

My problem is that this scheme is causing some trouble, relative to 
context saving :
I saw there is a checkpointing option, allowing to save the contest in 
case a heavy job crashes.
Hence my interrogation : Is there a way to use this checkpointing in 
case a job is preempted ?
For instance, I have standard jobs running in a queue sharing nodes with 
the empty parallel one. As a parallel job is submitted, the subordinate 
standard queue is suspended, and this has 2 bad implications :
_ a part of the RAM is used by the simple job. I do not think it is 
freed up, and this might cause problems for the upcoming parallel job, 
for it does consume loads of RAM
_ If the // job is taking too long, I cannot suspend it in order to 
start the suspended simple job again

To sum up :
1/ Is there a way to have the context saved, in order to free up the RAM ?
2/ Is it possible to reschedule a suspended job somewhere else while 
keeping its context ?
3/ If none of the above is possible, what do you recommend I do ?


Thanks for having read.
Guillaume Quéré

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266171

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list