[GE users] Scheduling of long jobs

Patrice Seyed apseyed at bu.edu
Wed Aug 25 17:43:31 BST 2004


Andy,

So basically you're saying to impose restriction or make it desirable for
the user to break their jobs up, or make "chain" jobs? In this you mean
create fewer queues with unlimited real time/wall time? I'm trying to think
the best mechanism to impose or encourage doing this (if/considering it is
possible for their jobs).

I can check to see this is feasible, I know for some users the mechanism or
checkpointing is sort of built in, where if there job were killed, there
output file from the job would still exist, and they could resubmit it where
it left off.

-Patrice

-----Original Message-----
From: Andy Schwierskott [mailto:andy.schwierskott at sun.com] 
Sent: Wednesday, August 25, 2004 5:08 AM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] Scheduling of long jobs

Patrice,

> That being said, in general subordinate queues are great for short vs.
long
> jobs, and suspending the long jobs for short jobs to run then get out.
>
> But what is the better strategy when you have jobs that run for
> weeks/months, and its many of the users submitting these type of jobs?
> (better than manually changing the maxjobs per user parameter relative
> available queues at a given time)

Typically sites who have such long running jobs which simply do not fit well
in a standard policy scheme break up their jobs into chain jobs where after
a certain period of runtime the job exits and resubmits itself. SGE's
"qresub" command helps to simplify this. This is a typical mainframe
operating mode.

The can be automated by submitting these jobs as checkpoint jobs where a
suspension of the job can be configured to cause a migration of the job. I'm
implying here that in real life such long running jobs anyhow must be by
some means checkpointable - probably no one serioulsy expects that there
will be no downtime of a server for such long periods. The risk of losing
compute time of several weeks is just quite high.

Andy

> -----Original Message-----
> From: Patrice Seyed [mailto:apseyed at bu.edu]
> Sent: Thursday, August 19, 2004 7:17 PM
> To: 'users at gridengine.sunsource.net'
> Subject: RE: [GE users] Scheduling of long jobs
>
> Interesting Reuti, but like you said a job in long00b stay there even if a
> long00b is open. I'm not sure if this method is "smarter" than mine, but
> with your method you don't have a possible scenerio where 3 jobs are
running
> over 2 cpus, but even though that can occur on mine, it can for no more
than
> 2 hours (hard limit on express queues), unless there is are jobs waiting
to
> get into an express queue.
>
> I agree it would be nice to be able to suspend a slot instead of a queue.
> The current setup is more attuned for single cpu jobs, and also for my
> cluster making a queue for each single cpu doesn't seem feasible.
>
> Hmm..
> Patrice
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Thursday, August 19, 2004 4:47 PM
> To: users at gridengine.sunsource.net
> Subject: [GE users] Scheduling of long jobs
>
> Hi Patrice,
>
>> I am using SGEE 5.3, and what I have done so far is deployed the concept
of
>> "express" queues, where jobs submitted to these queues, when two or more
>> jobs, suspend jobs in the regular queues until there is 1 or less in the
>> express queue, but have a 2 hour limits in the express queue. Since my
>> machines are dual cpued I could not do hierachial queues, in terms of
>> walltime, unless I made a queue for each job slot. Also I am aware of the
>> max job per user limit, this can help but since it is a hard limit it
also
>> can restrict when queues are open.
>
> yes, this is also they way I solved it. But you will need only three
queues
> per
> node, and adjust them in a way that unnecessary suspends are avoided:
>
> $ qconf -sq long00a
> qname                long00a
> hostname             node00
> seq_no               11
> ..
> slots                1
> ..
>
> $ qconf -sq long00b
> qname                long00b
> hostname             node00
> seq_no               21
> ..
> slots                1
> ..
>
> $ qconf -sq short00
> qname                short00
> hostname             node00
> seq_no               51
> ..
> slots                2
> ..
> subordinate_list     long00a=2, long00b=1
> ..
>
> When you also set "queue_sort_method seqno", the long jobs will go to
> long..a
> first, but in case of a short job the long..b will be suspended first.
Yes,
> it's not perfect, because no job will change from queue long..b to
long..a,
> when the job in long..a finish.
>
> On the other, also with four queues per host, you would have the same load
> on a
> machine, whether there are 2 long or (1 long + 1 short) job running, and
the
>
> scheduler will select one machine for you for your new short job.
>
> Maybe it would be an enhancement to SGE, if you could specify not to
suspend
>
> the whole subordinated queue, but only so many slots, as slots in the
> superordinated queue are taken. The next enhancement would be to make a
> round
> robin over all the used slots in the subordinated queue, so that they
share
> the
> remaining slot over time, e.g. to switch between the running jobs there
> every 5
> minutes. Do you think it's worth to be entered in Issuezilla?
>
> Cheers - Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


Regards,
Mit freundlichen Gruessen,
Andy
Schwierskott

--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Andy Schwierskott           Tel:     +49 941 3075-200  (x60200)
N1 Grid Engine Engineering  Support: +49 941 3075-250  (x60250)
Sun Microsystems GmbH       Fax:     +49 941 3075-222  (x60222)
Dr.-Leo-Ritter-Str. 7       mailto:andy.schwierskott at sun.com
D-93049 Regensburg          http://www.sun.com/gridware

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list