[GE users] Scheduling of long jobs

Andy Schwierskott andy.schwierskott at sun.com
Wed Aug 25 10:08:22 BST 2004


Patrice,

> That being said, in general subordinate queues are great for short vs. long
> jobs, and suspending the long jobs for short jobs to run then get out.
>
> But what is the better strategy when you have jobs that run for
> weeks/months, and its many of the users submitting these type of jobs?
> (better than manually changing the maxjobs per user parameter relative
> available queues at a given time)

Typically sites who have such long running jobs which simply do not fit well
in a standard policy scheme break up their jobs into chain jobs where after
a certain period of runtime the job exits and resubmits itself. SGE's
"qresub" command helps to simplify this. This is a typical mainframe
operating mode.

The can be automated by submitting these jobs as checkpoint jobs where a
suspension of the job can be configured to cause a migration of the job. I'm
implying here that in real life such long running jobs anyhow must be by
some means checkpointable - probably no one serioulsy expects that there
will be no downtime of a server for such long periods. The risk of losing
compute time of several weeks is just quite high.

Andy

> -----Original Message-----
> From: Patrice Seyed [mailto:apseyed at bu.edu]
> Sent: Thursday, August 19, 2004 7:17 PM
> To: 'users at gridengine.sunsource.net'
> Subject: RE: [GE users] Scheduling of long jobs
>
> Interesting Reuti, but like you said a job in long00b stay there even if a
> long00b is open. I'm not sure if this method is "smarter" than mine, but
> with your method you don't have a possible scenerio where 3 jobs are running
> over 2 cpus, but even though that can occur on mine, it can for no more than
> 2 hours (hard limit on express queues), unless there is are jobs waiting to
> get into an express queue.
>
> I agree it would be nice to be able to suspend a slot instead of a queue.
> The current setup is more attuned for single cpu jobs, and also for my
> cluster making a queue for each single cpu doesn't seem feasible.
>
> Hmm..
> Patrice
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Thursday, August 19, 2004 4:47 PM
> To: users at gridengine.sunsource.net
> Subject: [GE users] Scheduling of long jobs
>
> Hi Patrice,
>
>> I am using SGEE 5.3, and what I have done so far is deployed the concept of
>> "express" queues, where jobs submitted to these queues, when two or more
>> jobs, suspend jobs in the regular queues until there is 1 or less in the
>> express queue, but have a 2 hour limits in the express queue. Since my
>> machines are dual cpued I could not do hierachial queues, in terms of
>> walltime, unless I made a queue for each job slot. Also I am aware of the
>> max job per user limit, this can help but since it is a hard limit it also
>> can restrict when queues are open.
>
> yes, this is also they way I solved it. But you will need only three queues
> per
> node, and adjust them in a way that unnecessary suspends are avoided:
>
> $ qconf -sq long00a
> qname                long00a
> hostname             node00
> seq_no               11
> ..
> slots                1
> ..
>
> $ qconf -sq long00b
> qname                long00b
> hostname             node00
> seq_no               21
> ..
> slots                1
> ..
>
> $ qconf -sq short00
> qname                short00
> hostname             node00
> seq_no               51
> ..
> slots                2
> ..
> subordinate_list     long00a=2, long00b=1
> ..
>
> When you also set "queue_sort_method seqno", the long jobs will go to
> long..a
> first, but in case of a short job the long..b will be suspended first. Yes,
> it's not perfect, because no job will change from queue long..b to long..a,
> when the job in long..a finish.
>
> On the other, also with four queues per host, you would have the same load
> on a
> machine, whether there are 2 long or (1 long + 1 short) job running, and the
>
> scheduler will select one machine for you for your new short job.
>
> Maybe it would be an enhancement to SGE, if you could specify not to suspend
>
> the whole subordinated queue, but only so many slots, as slots in the
> superordinated queue are taken. The next enhancement would be to make a
> round
> robin over all the used slots in the subordinated queue, so that they share
> the
> remaining slot over time, e.g. to switch between the running jobs there
> every 5
> minutes. Do you think it's worth to be entered in Issuezilla?
>
> Cheers - Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


Regards,
Mit freundlichen Gruessen,
Andy
Schwierskott

--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Andy Schwierskott           Tel:     +49 941 3075-200  (x60200)
N1 Grid Engine Engineering  Support: +49 941 3075-250  (x60250)
Sun Microsystems GmbH       Fax:     +49 941 3075-222  (x60222)
Dr.-Leo-Ritter-Str. 7       mailto:andy.schwierskott at sun.com
D-93049 Regensburg          http://www.sun.com/gridware

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list