[GE users] Temporary removing jobs from the queues

dangruhn Dan.Gruhn at groupw.com
Tue Jul 7 15:18:44 BST 2009


mad wrote:
> On Jul 7, 2009, at 8:57 AM, dangruhn wrote:
>
>   
>> Margaret,
>>
>> mad wrote:
>>     
>>> I need to free up some slots on our system.  One user has submitted
>>> two jobs which are taking up all the resources.   I would like to
>>> "suspend" one of her jobs to allow use of the cluster by other users.
>>>
>>>
>>> I have tried suspend and hold through qmon.  However, the slots are
>>> still occupied.
>>>
>>>  qstat -g c
>>> CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS
>>> cdsuE
>>> -------------------------------------------------------------------------------
>>> all.q                             0.98     72      0     72
>>> 0      0
>>>
>>>
>>> and I cannot qlogin
>>>
>>>  qlogin
>>> Your job 13522 ("QLOGIN") has been submitted
>>> waiting for interactive job to be scheduled ...timeout (4 s) expired
>>> while waiting on socket fd 4
>>>
>>>
>>> Your "qlogin" request could not be scheduled, try again later.
>>>
>>> I do not want to kill the job.  How can I free up some of the slots?
>>>
>>>       
>> One possibility is to either suspend or hold (I can't remember which  
>> one
>> is the best) and then restart the job.
>>     
>
> Are you using "reschedule" to restart the job?  Resume just takes the  
> hold
> or  suspend status off the job.  I am still looking at qmon.   
> Restarting is better
> than killing especially if the user is not currently available.
>   
Yes, reschedule from the qmon/Job Control/Running Jobs tab.
>   
>> This will put the job back in
>> pending but it won't be eligible for execution until the suspend/ 
>> hold is
>> released.
>>
>> The down side is that this job will be starting over from scratch. Is
>> this okay or is that what you meant by saying you don't want to kill  
>> the
>> job?
>>     
>
>
>   
>>> Also how do I hold the user's jobs waiting on the queue so that I can
>>> release them in a manner that keeps some of the slots open for other
>>> users?
>>>
>>> ----------------------------------------------------------------------------
>>> all.q at compute-0-8.local        BIP   4/4       4.00     lx26-amd64
>>>   13512 0.25000 user1_SOLVER user1        s     07/06/2009
>>> 21:08:09     4
>>> ----------------------------------------------------------------------------
>>>
>>> Although this job is "suspended", it is still running on compute-0-8
>>> and taking up four CPUs.
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206003
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
>>> ].
>>>
>>>       
>> -- 
>> Dan Gruhn
>> Group W Inc.
>> 8315 Lee Hwy, Suite 303
>> Fairfax, VA, 22031
>> PH: (703) 752-5831
>> FX: (703) 752-5851
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206005
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
>> ].
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206007
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>   

-- 
Dan Gruhn
Group W Inc.
8315 Lee Hwy, Suite 303
Fairfax, VA, 22031
PH: (703) 752-5831
FX: (703) 752-5851

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206019

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list