[GE users] qlicserver behavior for suspended jobs

olesen Mark.Olesen at emconTechnologies.com
Fri Nov 20 11:52:42 GMT 2009


> > Oh no!
> > Are you really sure you want to treat the available/used licenses as a load?
> >
> > There is some background here:
> >   http://olesenm.github.com/flex-grid/background.html
> > including a link to a pdf that (I hope) might make you reconsider if you really want
> > to use load sensors for managing consumable resources. I certainly would not wish to.

> Correct me if I'm wrong, please: flex-grid tries to make sure jobs
> don't get queued if there are
> no licenses available. This is important if you're really constrained
> on compute resources,
> so it would be wasteful to run a job that has to sit and wait for a license.
> 
> I have a different situation; this thread suggests other have that case as well.
> 
> I have more CPUs than licenses.

I don't see much difference in what you describe.
With many CPUs and few licenses, you still wish to avoid jobs starting
when there aren't enough licenses.

If you use a plain load sensor with your situation (many cpus and few
licenses), you will get what I termed a "crash condition" in the pdf
presentation
(http://olesenm.github.com/flex-grid/doc/SGE-WS2007-FlexLM-Integration-MarkOlesen.pdf), since calling it a race condition is really much too mild.

Lets see what could happen in your case if you use a plain load sensor.
To illustrate things, we'll deliberately make it quite extreme.
Say you have lots of CPUs (1000) and several waiting jobs (500) but
relatively few licenses available (currently 0, for what reason).

The load sensor reports 0 licenses.
The cluster is empty (1000 slots available), but the 500 jobs are
waiting correctly for a license, since you specified '-l somelicense=1'.
At some point, your rare license becomes available.
At some reporting interval the load sensor will report 1 license is
available.
The resource conditions (slots and license) are now satisfied and the
scheduler dispatches the 500 jobs.
A single job will win the race and get the license. The other 499 jobs
will either fail miserably, or perhaps queue up their request to FlexLM
and wait internally for a license.
In either case you have a serious problem. Either almost every job
fails, or you jam up the cluster with jobs that aren't doing anything
except waiting internally for a license. If the second is the behaviour
you really want, then there is no point bothering with having GridEngine
manage the resource for you.


In any case, the best idea is to create a new complex (eg, 'dummytest')
and report its value via a load sensor as you planned. Instead of
querying a license server, the load sensor would just read the value
from a file that you can adjust by hand. For a test job, you can use a
simple 'sleep 1000' with the appropriate qsub -l dummytest=1 parameter.
Play with it a bit and see if you get a reasonable behaviour for various
usage scenarios.

/mark

This e-mail message and any attachments may contain legally privileged, confidential or proprietary Information, or information otherwise protected by law of EMCON Technologies, its affiliates, or third parties. This notice serves as marking of its "Confidential" status as defined in any confidentiality agreements concerning the sender and recipient. If you are not the intended recipient(s), or the employee or agent responsible for delivery of this message to the intended recipient(s), you are hereby notified that any dissemination, distribution or copying of this e-mail message is strictly prohibited. 
If you have received this message in error, please immediately notify the sender and delete this e-mail message from your computer.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=228185

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list