[GE users] Preemption vs dedicating nodes by group?

Justus Loerke loerke at molgen.mpg.de
Fri Apr 29 11:42:29 BST 2005

Hi Jim,

yes, I have configured all secondary queues open to all users. My first 
idea was to reconfigure queue access depending on a load sensor (as 
suggested by Reuti, I believe, in a previous thread) monitoring the jobs 
submitted by the different groups. This setup, however, saves me the 
work of having to reconfigure queue access all the time.

I have done only some preliminary testing so far, since I have running 
jobs on my machines and didn't want to disrupt any of these. Suspension 
and restarting of the jobs seems to work fine, but I will have to do 
some work on the file corruption problem on the scripting level first. I 
am not only worried about this, in our case it's a serious problem. We 
are running iterating molecular volume refinement, split into some 100s 
of single jobs, that read and write to the same files more or less over 
the whole time the jobs is running. If this kind of job is killed and 
restarted, it will find a file (it has produced ifself before being 
killed) that has a mix of old and new (from the previously started and 
killed job) data and this will in effect kill not only this one job, but 
the whole iterative process depending on this data (or even worse: 
disrupt the data and introduce errors into our calculations). It's not 
that hard to work around (using temporary data files), but I haven't had 
time to do it yet, so far I have only done some testing on the basic 
queue setup.

I took the idea for this setup from the 'HOWTO configure idle 
workstations for queue work and howto to migrate jobs from workstations' 
on the GE page. The only difference is that I'm really using 
checkpointing for migrating jobs only, the cpu time already spent on 
this job is wasted (although I'm working on user-level checkpointing as 
well). In my understanding, the checkpointing is activated by the -ckpt 
flag in qsub and triggered by the suspend command, and the checkpointing 
option 'restart on suspend' then leads to migration. Jobs submitted 
without the ckpt flag will run on primary queue, secondary queues if any 
primary queues are empty, but will not kicked off the secondary queue if 
a higher priority job comes along; it will only be suspended, as with 
normal subordinated queues. In some cases this may not be a problem, but 
in my case, I used migration to work around the worst case, in which 
some group1 calculations are idle, waiting for one job suspended for 
weeks on a secondary queue because group2 is using all primary queues.

Hope that helps,


Dipl. Phys. Justus Loerke
- UltraStrukturNetzwerk -
Max Planck Institute for Molecular Genetics
Ihnestr. 63-73
D-14195 Berlin

Tel.:   +49-30-8413-1644
Fax:    +49-30-8413-1385
E-mail: loerke at molgen.mpg.de  

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list