[GE users] actively draining a queue to allow big parallel job ....

Reuti reuti at staff.uni-marburg.de
Sun Jun 24 16:20:19 BST 2007


checkpointing a parallel application is always tricky, but if your  
application support this, it's great.

But instead of writing a cron job, I would suggest of having two  
parallel queues: one for the normal jobs normal.q, one for the big  
ones big.q (which should start immediately). The normal.q is  
subordinated to big.q, hence will be suspended if the big job starts  
to run.

As you would like the normal jobs to be checkpointed instead of  
suspended, you could setup a checkpointing environment in SGE with  
"application-level" interface. The to be defined "migr_command"- 
script  in this setup (as it's aware which job it belongs to) can  
easily write the necessary stop-file. So all small jobs have to  
specify to run with this checkpointing environment.

http://gridengine.sunsource.net/howto/checkpointing.html section "The  
application-level interface".

Be aware, that SGE will in this case neither kills the normal job,  
nor suspends it. This is up to your script now! The Howto assumes to  
run the jobs local on a node, so all interim files must be copied to  
a common checkpoint directory to migrate, and reused when the job  
starts again. Hence to be copied from this shared location to a  
different node's $TMPDIR. If your aplication is using a shared CWD  
anyway, this might not be necessary.

If you have more than one core per node, this might lead to the  
situation, that too many jobs are stopped first. After the big jobs  
started to run, some of these stopped smaller jobs might start in the  
cluster again with a different distribution schema. This depends of  
the allocation rule of the PE for the normal and big jobs. Maybe it  
would be good, to have always a fixed allocation rule like 2 or 4 (or  
at least $fill_up).

-- Reuti

Am 24.06.2007 um 16:27 schrieb Lydia Heck:

> I am investigating a way to allow big parallel jobs (say 128 cpu jobs)
> a sensible chance to get time on a cluster which only runs parallel  
> jobs.
> I would like to avoid that half the cluster is empty when queues  
> are drained
> of existing smaller runs. I would like to avoid to kill jobs in mid- 
> flow,
> before they can sensibly stop, and lose hundreds of hours of cpu time.
> What I would like to do is the following - and maybe somebody out  
> there has
> already done something similar. In order to understand the story  
> better here
> is a short introduction to they type of codes which are run:
> The codes are "checkpointing" frequently by writing re-start files,  
> where
> the frequency is determined by the user in an input file.
> The codes can be stopped by creating an empty file in a specific  
> directory,
> of which the name is predefined either in the code or in an input  
> file.
> What I would like to do is to check periodically using a cron job,  
> if a
> "big" job has been submitted. If a big job has been submitted I  
> assume that one
> parameter - -pe big 128  has been set. I then prepare the parallel  
> environment
> to have 128 cpus, subtracting the number of cpus from the "normal"  
> parallel
> environment. So far so good!.
> Then I would like to "send" a signal to a set of running jobs, with  
> a total of
>> = 128 cpus, to write their restart files and terminate sensibly.
> There is of course the way to find out from where the job has been  
> started,
> to tell the users to make sure that their program tests for the  
> "stop" file
> to be listed in the CWD directory. But that is somewhat fraught with
> problems. One can easily envisage that two jobs are started from  
> that directory
> and both jobs would see the stop file. So the user would have to  
> make sure
> that a unique stop file is looked for, which could of course depend  
> on the  PID
> of the master process in an MPI job.
> Again there is problem, as the PID on one system is unique, but  
> with hundreds
> of systems the same PID for two different jobs could happen. So the  
> user
> would have to test it against the JOB_ID from grid engine if that  
> is possible.
> It would be neater, if a sig handling call could be introduced to  
> the codes as
> a matter of course. However the signal would have to be  
> transported: A simple
> qdel would not be possible as that kills the job outright.
> If anybody has thought of a scenario like this, and was prepared to  
> share
> there solutions or attempts to it, I would be grateful to hear from  
> them.
> ------------------------------------------
> Dr E L  Heck
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
> United Kingdom
> e-mail: lydia.heck at durham.ac.uk
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___________________________________________
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list