[GE users] scheduler crashes on applying qrs for queued job?

SLIM H.A. h.a.slim at durham.ac.uk
Mon Sep 17 17:04:17 BST 2007


Reuti

The problem I want to solve is this. Part of the cluster was financied
by a small group of users but this subcluster may also be used by
others. The owner-users should have access on short notice and/or high
priority compared to the other users. 
One option is to give the owner-users unlimited cpu time and restrict
this for others. This would allow other users access for a guaranteed
limited amount of time, when the subcluster is not used by the owners. 
I agree I could create additional queues to manage this but would like
to keep things simple for the users and use a single queue. Our system
is quite heterogenous wrt nodes, with 3 types of interconnect and
various serial and parallel queues. The rqs looked like a simple
solution (the example that crashes the scheduler is merely a simple
test)

Regards

Henk

> 
> Resource quotas are targeting consumable and fixed complexes 
> for now.  
> Making h_rt consumable is not really an option, but it would 
> be possible to define two queues on @pvmhosts with different 
> user_lists set. The time you set there (in the queue 
> definition) for h_rt in one of them will also be enforced. 
> Maybe you have to limit the total slot count also in the 
> exechost definition, as you habe now (at least) two queues 
> per machine.
> 
> But anyway: crashing the scheduler is of course a bug, 
> instead there should be an error message. I suggest to file 
> an issue for it.
> 
> -- Reuit
>  

> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de] 
> Sent: 17 September 2007 16:12
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] scheduler crashes on applying qrs for 
> queued job?
> 
> Am 17.09.2007 um 16:33 schrieb SLIM H.A.:
> 
> > The following problem occurred with 6.1u2 on OpenSUSE 10:
> >
> > I have a single resource quota set with a limit set for a 
> single user
> > dcl0has:
> >
> > {
> >    name         ham3
> >    description  "Restrict wall clock time"
> >    enabled      TRUE
> >    limit        users dcl0has hosts @pvmhosts to h_rt=600
> > }
> >
> > When user dcl0has submits a job the scheduler crashes. The 
> job remains 
> > in the qw state and after the scheduler has been restarted 
> it will try 
> > to schedule it but crashes again.
> >
> > Replacing the user by an acl, like
> >
> >    limit        users testproject hosts @pvmhosts to h_rt=600
> >
> > where qconf -su testproject gives
> > name    testproject
> > type    ACL
> > fshare  0
> > oticket 0
> > entries dcl0has, ... and more users
> >
> > does work for user dcl0has although the wall clock time limit of 10 
> > minutes is not enforced. I assume h_rt can be used in rq sets?
> >
> > I haven't found anything relevant in the message files (yet).
> > Is this a known problem?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list