[GE users] SGE 6.5 Scheduler Query--Reference

reuti reuti at staff.uni-marburg.de
Mon Feb 23 18:35:58 GMT 2009


Hi,

Am 23.02.2009 um 15:29 schrieb veerendra_n:

> I will check the following as per your advice.
> 1. Check if the application can release the license
> 2. If I'm able to checkpoint
> I'm curious to know if the above stated tasks work how I ensure  
> that the

these are the prerequisites. If this is working, you can submit the  
jobs by requesting a so called checkpointing environment in SGE. This  
will then be triggered by a co-scheduler to suspend the job under  
certain conditions. The suspend action can be adjusted in the  
checkpointing environment to trigger a migration of the job, i.e. put  
it again into the waiting state and freeing up a slot including the  
license.

-- Reuti


> running job is suspended after 5 min and the job in the queue is given
> precedence? Can you throw some light?
>
>
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: 23 February 2009 19:52
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] SGE 6.5 Scheduler Query--Reference
>
> Am 23.02.2009 um 13:19 schrieb veerendra_n:
>
>> Yes, the license is monitored by Flexlm ..lmgrd ..
>> We can work out on the 5min interval; I do understand it's short.
>> But what configuration I need to make to get this working?
>
> As I wrote: check whether you can suspend your application by hand
> and trigger it to give back the license. Otherwise all endeavors are
> useless. But even if this is working: the next advanced step would be
> to checkpoint your application. This is nothing which is related to
> SGE. When all this is working outside of SGE, than we can incorporate
> it .
>
> -- Reuti
>
>
>> -----Original Message-----
>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: 23 February 2009 17:43
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] SGE 6.5 Scheduler Query--Reference
>>
>> Hi,
>>
>> Am 22.02.2009 um 06:44 schrieb veerendra_n:
>>
>>> The jobs that are run are typical ASIC design jobs (Layout and
>>> verification
>>> jobs) and each of these jobs requires a license. I'm not sure if I
>>> have
>>> stated the requirement.
>>>
>>> 1. We will have a short queue (short.q) configured which will have
>>> 5 slots
>>> and configure a time limit for 5 min (a soft limit)
>>> 2. If we have run 5 jobs all the 5 jobs would have occupied the
>>> queue. Now
>>> if we fire a 6th job if any of the jobs which has taken more than 5
>>> min
>>> should be put on hold and the 6th job should be executed.
>>>
>>> How to automatically do it? As suggested by you how to use a co-
>>> scheduler?
>>> Do you have a sample of how to implement check point?
>>
>> this is far from being trivial. Best would be to have someone at your
>> location and look into it. 5 minutes looks like a short turnaround.
>> How long are your jobs running usually?
>>
>> The first thing to check is, whether your application can be
>> triggered to be put to sleep and give a license back at all (I assume
>> this is counted by something like FLEXlm or alike).
>>
>> -- Reuti
>>
>>
>>> I need some help....
>>>
>>> -----Original Message-----
>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: 22 February 2009 01:33
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] SGE 6.5 Scheduler Query--Reference :
>>> Mr.Sanjeev
>>> Patil
>>>
>>> Veerendra,
>>>
>>> Am 18.02.2009 um 17:32 schrieb veerendra_n:
>>>
>>>> Hi All,
>>>>
>>>> I will be very grateful if you can help me clear some of the Sun
>>>> Grid Engine 6.2 queries.
>>>>
>>>> Query:
>>>> My current query is regarding the Hard Run time limit and Soft Run
>>>> Time Limit set on jobs :
>>>>       Lets say we create many queues. One of the queue is for 1 min
>>>> jobs ( i.e. jobs that should take about 1 min to complete, short
>>>> jobs ). Now lets say 5 jobs can run simultaneously and all the
>>>> slots are occupied. Ideally the jobs should have been around 1 min.
>>>> But due to some reason, the jobs are actually longer. If a sixth
>>>> job is queued, it should get the priority since it is a short job.
>>>> The way it should be done is that the oldest job is suspended, one
>>>> license freed up , the sixth job ( just submitted ) run and the
>>>> oldest one is back in the queue for the license. So when the sixth
>>>> job is over, the oldest job can get the license again ( it is in
>>>> the queue and will be processed based on the queue )
>>>>
>>>> I tried to test the above in my lab set up and the following is
>>>> what I found :
>>>>
>>>> Hard Run Time setting :
>>>>      I configured a queue with Hard Run Time  set to 3 minutes and
>>>> tried to execute a job which takes more than 3 minutes.
>>>> I found that the job got killed once the 3 minutes interval was
>>>> completed.
>>>> (As per the sun grid document, a SIGKILL signal is sent and the job
>>>> gets killed)
>>>>
>>>> Soft Run Time Setting:
>>>>     I configured a queue with Soft Run Time set to 3 minutes and
>>>> Notify Interval to 60 sec and tried executing the same job.
>>>> The job got killed
>>>> (Again as per the document , a SIGUSR signal is sent as warning
>>>> after 3 minutes and a SIGKILL signal is sent to kill the job once
>>>> the Notify Interval is over )
>>>>
>>>> But as per the problem statement, I don't want the jobs to be
>>>> killed but should be suspended and rejoin the queue and the job
>>>> should resume once it gets a slot.
>>>
>>> this is not implemented in SGE. Once a job was allowed to start, it
>>> is supposed to run up to its end. I t might be suspended, but it  
>>> will
>>> still occupy the granted resources. This is not only a problem of
>>> SGE, but also of your application: you would have to instruct it to
>>> release its license temporarily.
>>>
>>> You could use a co-scheduler, which would check the waiting and
>>> running job. When it discovers, that another job should run, it has
>>> to a) put a running job on hold (to prevent its immediate restart),
>>> and b) reschedule the job. When there is no waiting job left, the
>>> waiting (and rescheduled) one could be released and would restart
>>> again. When I write restart, I mean it in exactly this way: without
>>> any checkpointing, you job will always restart from the beginning.
>>>
>>> -- Reuti
>>>
>>>
>>>> How can I achieve this. Is it possible to write a script and
>>>> reschedule the job to resume rather than kill the job.
>>>> I found that using the option Qalter( in QMON GUI) , I could
>>>> reschedule the job manually, but this is not a solution for systems
>>>> in real time environment.
>>>> Is it possible through scripting or is there any other option in
>>>> the QMON GUI which can solve this problem.
>>>> Please help me solve this issue.
>>>>
>>>> Will be waiting for your reply.
>>>>
>>>> Veerendra
>>>>
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=1
>>> 11323
>>>
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=111626
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=1
>> 12616
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=112619
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=1
> 12685
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=112686
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=112833

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list