[GE users] SGE 6.5 Scheduler Query--Reference : Mr.Sanjeev Patil

reuti reuti at staff.uni-marburg.de
Sat Feb 21 20:02:58 GMT 2009

    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]


Am 18.02.2009 um 17:32 schrieb veerendra_n:

> Hi All,
> I will be very grateful if you can help me clear some of the Sun  
> Grid Engine 6.2 queries.
> Query:
> My current query is regarding the Hard Run time limit and Soft Run  
> Time Limit set on jobs :
>       Lets say we create many queues. One of the queue is for 1 min  
> jobs ( i.e. jobs that should take about 1 min to complete, short  
> jobs ). Now lets say 5 jobs can run simultaneously and all the  
> slots are occupied. Ideally the jobs should have been around 1 min.  
> But due to some reason, the jobs are actually longer. If a sixth  
> job is queued, it should get the priority since it is a short job.  
> The way it should be done is that the oldest job is suspended, one  
> license freed up , the sixth job ( just submitted ) run and the  
> oldest one is back in the queue for the license. So when the sixth  
> job is over, the oldest job can get the license again ( it is in  
> the queue and will be processed based on the queue )
> I tried to test the above in my lab set up and the following is  
> what I found :
> Hard Run Time setting :
>      I configured a queue with Hard Run Time  set to 3 minutes and  
> tried to execute a job which takes more than 3 minutes.
> I found that the job got killed once the 3 minutes interval was  
> completed.
> (As per the sun grid document, a SIGKILL signal is sent and the job  
> gets killed)
> Soft Run Time Setting:
>     I configured a queue with Soft Run Time set to 3 minutes and  
> Notify Interval to 60 sec and tried executing the same job.
> The job got killed
> (Again as per the document , a SIGUSR signal is sent as warning  
> after 3 minutes and a SIGKILL signal is sent to kill the job once  
> the Notify Interval is over )
> But as per the problem statement, I don?t want the jobs to be   
> killed but should be suspended and rejoin the queue and the job  
> should resume once it gets a slot.

this is not implemented in SGE. Once a job was allowed to start, it  
is supposed to run up to its end. I t might be suspended, but it will  
still occupy the granted resources. This is not only a problem of  
SGE, but also of your application: you would have to instruct it to  
release its license temporarily.

You could use a co-scheduler, which would check the waiting and  
running job. When it discovers, that another job should run, it has  
to a) put a running job on hold (to prevent its immediate restart),  
and b) reschedule the job. When there is no waiting job left, the  
waiting (and rescheduled) one could be released and would restart  
again. When I write restart, I mean it in exactly this way: without  
any checkpointing, you job will always restart from the beginning.

-- Reuti

> How can I achieve this. Is it possible to write a script and  
> reschedule the job to resume rather than kill the job.
> I found that using the option Qalter( in QMON GUI) , I could  
> reschedule the job manually, but this is not a solution for systems  
> in real time environment.
> Is it possible through scripting or is there any other option in  
> the QMON GUI which can solve this problem.
> Please help me solve this issue.
> Will be waiting for your reply.
> Veerendra
> Yashasvi Information Solutions Pvt.Ltd
> #418 , 17th Main , 10th Cross
> JP Nagar , 2nd Phase
> Bangalore ? 560 078
> Mobile : +91-9972520661
> Email    :  veerendra at yashasvi.co.in


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list