[GE users] actively draining a queue to allow big parallel job ....

Reuti reuti at staff.uni-marburg.de
Thu Jul 12 19:36:48 BST 2007


Am 12.07.2007 um 18:35 schrieb SLIM H.A.:

> I read from the reply to a previous question that checkpointing  
> with the
> migr_command will not suspend a job:
>
>> As you would like the normal jobs to be checkpointed instead
>> of suspended, you could setup a checkpointing environment in
>> SGE with "application-level" interface. The to be defined
>> "migr_command"- script  in this setup (as it's aware which
>> job it belongs to) can easily write the necessary stop-file.
>> So all small jobs have to specify to run with this
>> checkpointing environment.
>>
>> http://gridengine.sunsource.net/howto/checkpointing.html
>> section "The application-level interface".
>>
>> Be aware, that SGE will in this case neither kills the normal
>> job, nor suspends it. This is up to your script now! The

Okay, should read: "...neither kill the normal *process* on the node,  
nor suspend it".

>
> The last paragraph seems to contradict to what the HowTo page says  
> about
>
> The application-level interface:
>
> ", the "migr_command" procedure will be executed if you suspend the  
> job
> or the queue which the job is running in "
>
> This suggests that the migr_command is executed _after_ the job is
> suspended (by the user or gridengine).
> Is suspending always done by sending the STOP signal (17)?

A job is an entry in SGE, which will run an application/process on a  
node.

If you now suspend a job, SGE will send a SIGSTOP to the complete  
process group on the node.

Unless you use the checkpointing interface with application-level  
checkpointing. Then suspending a job won't send any signal to  
anywhere, but instead your defined migration script will be executed.  
This can then tell the application (i.e. the process on the node)  to  
do a checkpoint to a file, then send any signal to the application,  
kill the application, kill the processgroup of the application...  
whatever you like. But you must do it on your own in the script.  
Otherwise the application/process will continue on the node.

Really suspending the application is worthless in this setup. a) it  
would block resources on the node, b) it might get rescheduled to a  
completely different node anyway.

-- Reuti


> Thanks
>
> Henk
>
>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: 24 June 2007 16:20
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] actively draining a queue to allow
>> big parallel job ....
>>
>> Lydia,
>>
>> checkpointing a parallel application is always tricky, but if
>> your application support this, it's great.
>>
>> But instead of writing a cron job, I would suggest of having
>> two parallel queues: one for the normal jobs normal.q, one
>> for the big ones big.q (which should start immediately). The
>> normal.q is subordinated to big.q, hence will be suspended if
>> the big job starts to run.
>>
>> As you would like the normal jobs to be checkpointed instead
>> of suspended, you could setup a checkpointing environment in
>> SGE with "application-level" interface. The to be defined
>> "migr_command"- script  in this setup (as it's aware which
>> job it belongs to) can easily write the necessary stop-file.
>> So all small jobs have to specify to run with this
>> checkpointing environment.
>>
>> http://gridengine.sunsource.net/howto/checkpointing.html
>> section "The application-level interface".
>>
>> Be aware, that SGE will in this case neither kills the normal
>> job, nor suspends it. This is up to your script now! The
>> Howto assumes to run the jobs local on a node, so all interim
>> files must be copied to a common checkpoint directory to
>> migrate, and reused when the job starts again. Hence to be
>> copied from this shared location to a different node's
>> $TMPDIR. If your aplication is using a shared CWD anyway,
>> this might not be necessary.
>>
>> If you have more than one core per node, this might lead to
>> the situation, that too many jobs are stopped first. After
>> the big jobs started to run, some of these stopped smaller
>> jobs might start in the cluster again with a different
>> distribution schema. This depends of the allocation rule of
>> the PE for the normal and big jobs. Maybe it would be good,
>> to have always a fixed allocation rule like 2 or 4 (or at
>> least $fill_up).
>>
>> -- Reuti
>>
>>
>> Am 24.06.2007 um 16:27 schrieb Lydia Heck:
>>
>>> I am investigating a way to allow big parallel jobs (say
>> 128 cpu jobs)
>>> a sensible chance to get time on a cluster which only runs parallel
>>> jobs.
>>>
>>> I would like to avoid that half the cluster is empty when
>> queues are
>>> drained of existing smaller runs. I would like to avoid to
>> kill jobs
>>> in mid- flow, before they can sensibly stop, and lose hundreds of
>>> hours of cpu time.
>>>
>>> What I would like to do is the following - and maybe somebody out
>>> there has already done something similar. In order to
>> understand the
>>> story better here is a short introduction to they type of
>> codes which
>>> are run:
>>>
>>> The codes are "checkpointing" frequently by writing re-start files,
>>> where the frequency is determined by the user in an input file.
>>>
>>> The codes can be stopped by creating an empty file in a specific
>>> directory, of which the name is predefined either in the
>> code or in an
>>> input file.
>>>
>>> What I would like to do is to check periodically using a
>> cron job, if
>>> a "big" job has been submitted. If a big job has been submitted I
>>> assume that one parameter - -pe big 128  has been set. I
>> then prepare
>>> the parallel environment to have 128 cpus, subtracting the
>> number of
>>> cpus from the "normal"
>>> parallel
>>> environment. So far so good!.
>>>
>>> Then I would like to "send" a signal to a set of running
>> jobs, with a
>>> total of
>>>> = 128 cpus, to write their restart files and terminate sensibly.
>>>
>>> There is of course the way to find out from where the job has been
>>> started, to tell the users to make sure that their program
>> tests for
>>> the "stop" file to be listed in the CWD directory. But that is
>>> somewhat fraught with problems. One can easily envisage
>> that two jobs
>>> are started from that directory and both jobs would see the
>> stop file.
>>> So the user would have to make sure that a unique stop file
>> is looked
>>> for, which could of course depend on the  PID of the master
>> process in
>>> an MPI job.
>>> Again there is problem, as the PID on one system is unique,
>> but with
>>> hundreds of systems the same PID for two different jobs
>> could happen.
>>> So the user would have to test it against the JOB_ID from
>> grid engine
>>> if that is possible.
>>>
>>> It would be neater, if a sig handling call could be
>> introduced to the
>>> codes as a matter of course. However the signal would have to be
>>> transported: A simple
>>> qdel would not be possible as that kills the job outright.
>>>
>>> If anybody has thought of a scenario like this, and was prepared to
>>> share there solutions or attempts to it, I would be
>> grateful to hear
>>> from them.
>>>
>>>
>>>
>>> ------------------------------------------
>>> Dr E L  Heck
>>>
>>> University of Durham
>>> Institute for Computational Cosmology
>>> Ogden Centre
>>> Department of Physics
>>> South Road
>>>
>>> DURHAM, DH1 3LE
>>> United Kingdom
>>>
>>> e-mail: lydia.heck at durham.ac.uk
>>>
>>> Tel.: + 44 191 - 334 3628
>>> Fax.: + 44 191 - 334 3645
>>> ___________________________________________
>>>
>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list