[GE users] How to configure SGE

Reuti reuti at staff.uni-marburg.de
Thu Aug 3 20:06:33 BST 2006


Hi,

Am 03.08.2006 um 17:18 schrieb Yoshio Tanaka:

>
> Hello Reuti,
>
> Thanks for your comments.  Let me give answers to your questions and
> comments.
>
> reuti> > Would someone give advices and/or recommendations on how  
> to configure
> reuti> > SGE to satisfy the following requirements?
> reuti> >
> reuti> > - Each user is able to execute at least 24 simultaneous  
> jobs if node
> reuti> >   is available.
> reuti> >
> reuti> > - Even if a user is executing 24 jobs, he/she is allowed  
> to execute
> reuti> >   more jobs if node is available.
> reuti>
> reuti> why just 24? Do you have enough nodes, that you could give  
> each user
> reuti> 24 machines/slots as his primary machines/slots, and later  
> on use a
> reuti> type of secondary queue for each one?
>
> Unfortunately, we only have 32 nodes shared by few users.  Each user
> may submit several hundreds of short-term (about 10 minutes) jobs.
> Therefore, we cannot give each user 24 machines/slots as his primary
> machines/slots.

okay.

> reuti> > - If a user (user A) is executing more than 24 jobs and  
> the other user
> reuti> >   (user B) submit a new job, user A's excessive jobs will  
> be killed
> reuti> >   and user B's jobs will be activated.  User A's killed  
> jobs will be
> reuti> >   re-submitted to the queue.
> reuti>
> reuti> To achieve this, you could combine a subordinate queue (to  
> suspend
> reuti> the jobs) with the checkpointing feature, where a suspend will
> reuti> reschedule a job. But it's not a good setup, if nodes are  
> hard-wired
> reuti> to users already as mentioned before.
>
> We actually considered to use a subordinate queue, but we did not
> choose this option since
> - we would not like to provide two queues for users,
> - in our understanding, all jobs in a subordinate queue will be
>   suspended if the the number of jobs submitted to the primary queue
>   will exceed the limit

Correct.

> reuti> A simple fair-share setup isn't working for you - are the jobs
> reuti> running a long time?
>
> Since each job is short term, fair share may be a good choice.
> However, does fair-share support suspesion and requeue?

Please have a look in the admin manual:

http://docs.sun.com/app/docs/doc/817-5677?a=load

page 133 for the setup. There will be no suspension or rescheduling,  
but as the jobs are short as you said, the just submitted user's B  
jobs will run if any of user's A jobs end (although he/she submitted  
hundreds of jobs before user's B ones). The idea is, that all users  
have the same number of jobs running in the cluster. You could only  
force this to happen by rescheduling a job by hand with qmod -rj.

One additonal hint: you could also limit the number of running jobs  
per user in the cluster by setting maxujobs in the "qconf -msconf" to  
24 as an additonal limit, but then never more than 24 will run at the  
same time for each user (and nodes may be idling).

Cheers - Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list