[GE users] SGE 6.5 Scheduler Query--Reference

veerendra_n veerendra at yashasvi.co.in
Mon Feb 23 14:29:00 GMT 2009


Hi Reuti,

I will check the following as per your advice.
1. Check if the application can release the license
2. If I'm able to checkpoint 
I'm curious to know if the above stated tasks work how I ensure that the
running job is suspended after 5 min and the job in the queue is given
precedence? Can you throw some light?


-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: 23 February 2009 19:52
To: users at gridengine.sunsource.net
Subject: Re: [GE users] SGE 6.5 Scheduler Query--Reference

Am 23.02.2009 um 13:19 schrieb veerendra_n:

> Yes, the license is monitored by Flexlm ..lmgrd ..
> We can work out on the 5min interval; I do understand it's short.
> But what configuration I need to make to get this working?

As I wrote: check whether you can suspend your application by hand  
and trigger it to give back the license. Otherwise all endeavors are  
useless. But even if this is working: the next advanced step would be  
to checkpoint your application. This is nothing which is related to  
SGE. When all this is working outside of SGE, than we can incorporate  
it .

-- Reuti


> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: 23 February 2009 17:43
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] SGE 6.5 Scheduler Query--Reference
>
> Hi,
>
> Am 22.02.2009 um 06:44 schrieb veerendra_n:
>
>> The jobs that are run are typical ASIC design jobs (Layout and
>> verification
>> jobs) and each of these jobs requires a license. I'm not sure if I
>> have
>> stated the requirement.
>>
>> 1. We will have a short queue (short.q) configured which will have
>> 5 slots
>> and configure a time limit for 5 min (a soft limit)
>> 2. If we have run 5 jobs all the 5 jobs would have occupied the
>> queue. Now
>> if we fire a 6th job if any of the jobs which has taken more than 5
>> min
>> should be put on hold and the 6th job should be executed.
>>
>> How to automatically do it? As suggested by you how to use a co-
>> scheduler?
>> Do you have a sample of how to implement check point?
>
> this is far from being trivial. Best would be to have someone at your
> location and look into it. 5 minutes looks like a short turnaround.
> How long are your jobs running usually?
>
> The first thing to check is, whether your application can be
> triggered to be put to sleep and give a license back at all (I assume
> this is counted by something like FLEXlm or alike).
>
> -- Reuti
>
>
>> I need some help....
>>
>> -----Original Message-----
>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: 22 February 2009 01:33
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] SGE 6.5 Scheduler Query--Reference :
>> Mr.Sanjeev
>> Patil
>>
>> Veerendra,
>>
>> Am 18.02.2009 um 17:32 schrieb veerendra_n:
>>
>>> Hi All,
>>>
>>> I will be very grateful if you can help me clear some of the Sun
>>> Grid Engine 6.2 queries.
>>>
>>> Query:
>>> My current query is regarding the Hard Run time limit and Soft Run
>>> Time Limit set on jobs :
>>>       Lets say we create many queues. One of the queue is for 1 min
>>> jobs ( i.e. jobs that should take about 1 min to complete, short
>>> jobs ). Now lets say 5 jobs can run simultaneously and all the
>>> slots are occupied. Ideally the jobs should have been around 1 min.
>>> But due to some reason, the jobs are actually longer. If a sixth
>>> job is queued, it should get the priority since it is a short job.
>>> The way it should be done is that the oldest job is suspended, one
>>> license freed up , the sixth job ( just submitted ) run and the
>>> oldest one is back in the queue for the license. So when the sixth
>>> job is over, the oldest job can get the license again ( it is in
>>> the queue and will be processed based on the queue )
>>>
>>> I tried to test the above in my lab set up and the following is
>>> what I found :
>>>
>>> Hard Run Time setting :
>>>      I configured a queue with Hard Run Time  set to 3 minutes and
>>> tried to execute a job which takes more than 3 minutes.
>>> I found that the job got killed once the 3 minutes interval was
>>> completed.
>>> (As per the sun grid document, a SIGKILL signal is sent and the job
>>> gets killed)
>>>
>>> Soft Run Time Setting:
>>>     I configured a queue with Soft Run Time set to 3 minutes and
>>> Notify Interval to 60 sec and tried executing the same job.
>>> The job got killed
>>> (Again as per the document , a SIGUSR signal is sent as warning
>>> after 3 minutes and a SIGKILL signal is sent to kill the job once
>>> the Notify Interval is over )
>>>
>>> But as per the problem statement, I don't want the jobs to be
>>> killed but should be suspended and rejoin the queue and the job
>>> should resume once it gets a slot.
>>
>> this is not implemented in SGE. Once a job was allowed to start, it
>> is supposed to run up to its end. I t might be suspended, but it will
>> still occupy the granted resources. This is not only a problem of
>> SGE, but also of your application: you would have to instruct it to
>> release its license temporarily.
>>
>> You could use a co-scheduler, which would check the waiting and
>> running job. When it discovers, that another job should run, it has
>> to a) put a running job on hold (to prevent its immediate restart),
>> and b) reschedule the job. When there is no waiting job left, the
>> waiting (and rescheduled) one could be released and would restart
>> again. When I write restart, I mean it in exactly this way: without
>> any checkpointing, you job will always restart from the beginning.
>>
>> -- Reuti
>>
>>
>>> How can I achieve this. Is it possible to write a script and
>>> reschedule the job to resume rather than kill the job.
>>> I found that using the option Qalter( in QMON GUI) , I could
>>> reschedule the job manually, but this is not a solution for systems
>>> in real time environment.
>>> Is it possible through scripting or is there any other option in
>>> the QMON GUI which can solve this problem.
>>> Please help me solve this issue.
>>>
>>> Will be waiting for your reply.
>>>
>>> Veerendra
>>>
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=1
>> 11323
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=111626
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=1
> 12616
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=112619
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=1
12685

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=112686

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list