[GE users] qlicserver behavior for suspended jobs

olesen Mark.Olesen at emconTechnologies.com
Thu Nov 19 15:49:04 GMT 2009


Hi,

> Mark,
> 
> Could you take a look at the following thread and see if it makes sense in my situation?
> 
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=199539
> 
> I'm wondering if it points to either an SGE configuration or maybe a possible implementation in qlicserver.

Yeah, I saw what Reuti wrote. Indirectly this answers one of your
original questions: GridEngine considered the consumables to be bound to
your suspended job, even if your application doesn't.

As Reuti points out, there are two problems:
1:
> The problem is, that the complex is increased after the suspension is
> issued. So you can suspend by hand and it's working*, but it can't be
> used for subordination.

You'd either need to have enough licenses to kick your job out, or force
job suspension manually. This probably isn't the real problem in your
case.

The second problem is the real problem.
2:
> The way back is more complex. When you resume the job, you first have 
> to check in a "qmod -usj ..." wrapper whether enough resources are 
> still free, and maybe decrease it already at that time. When the 
> resume script runs it's to late, and another may have just slipped in 
> and you are getting out of sync and a negative total count.


I don't see how you are going to get around this problem. If you
artificially increase the total number of licenses (to compensate for
the fact that your suspended job is not using them any more), yet
another job can squeeze in when the suspending job is finished.
When your suspended job resumes, it'll probably crash since someone else
has its licenses.

The only chance you *might* have is if you add in an extra license check
in the prolog for new jobs. It should check if the available license
really are available or if they are actually reserved for a suspended
job to resume and exit 99 accordingly. But this is all getting pretty
hairy to implement since you are short circuiting much of the GridEngine
logic.

---

Perhaps if you turn the problem around it might work.
Say you tried defining a resource 'lic' that is derived from the
resources 'licFloat' and 'licSusp'. The resource 'licFloat' is gathered
from FlexLM and 'licSusp' tracks the licenses that should be attached to
a suspended job. Maybe you can somehow get it working based on that type
of logic. How exactly is a bit beyond me though - sorry.

/mark


This e-mail message and any attachments may contain legally privileged, confidential or proprietary Information, or information otherwise protected by law of EMCON Technologies, its affiliates, or third parties. This notice serves as marking of its "Confidential" status as defined in any confidentiality agreements concerning the sender and recipient. If you are not the intended recipient(s), or the employee or agent responsible for delivery of this message to the intended recipient(s), you are hereby notified that any dissemination, distribution or copying of this e-mail message is strictly prohibited. 
If you have received this message in error, please immediately notify the sender and delete this e-mail message from your computer.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=227994

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list