[GE users] Temporary removing jobs from the queues

mad margaret_Doll at brown.edu
Tue Jul 7 14:11:21 BST 2009


On Jul 7, 2009, at 8:57 AM, dangruhn wrote:

> Margaret,
>
> mad wrote:
>> I need to free up some slots on our system.  One user has submitted
>> two jobs which are taking up all the resources.   I would like to
>> "suspend" one of her jobs to allow use of the cluster by other users.
>>
>>
>> I have tried suspend and hold through qmon.  However, the slots are
>> still occupied.
>>
>>  qstat -g c
>> CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS
>> cdsuE
>> -------------------------------------------------------------------------------
>> all.q                             0.98     72      0     72
>> 0      0
>>
>>
>> and I cannot qlogin
>>
>>  qlogin
>> Your job 13522 ("QLOGIN") has been submitted
>> waiting for interactive job to be scheduled ...timeout (4 s) expired
>> while waiting on socket fd 4
>>
>>
>> Your "qlogin" request could not be scheduled, try again later.
>>
>> I do not want to kill the job.  How can I free up some of the slots?
>>
> One possibility is to either suspend or hold (I can't remember which  
> one
> is the best) and then restart the job.

Are you using "reschedule" to restart the job?  Resume just takes the  
hold
or  suspend status off the job.  I am still looking at qmon.   
Restarting is better
than killing especially if the user is not currently available.

> This will put the job back in
> pending but it won't be eligible for execution until the suspend/ 
> hold is
> released.
>
> The down side is that this job will be starting over from scratch. Is
> this okay or is that what you meant by saying you don't want to kill  
> the
> job?


>> Also how do I hold the user's jobs waiting on the queue so that I can
>> release them in a manner that keeps some of the slots open for other
>> users?
>>
>> ----------------------------------------------------------------------------
>> all.q at compute-0-8.local        BIP   4/4       4.00     lx26-amd64
>>   13512 0.25000 user1_SOLVER user1        s     07/06/2009
>> 21:08:09     4
>> ----------------------------------------------------------------------------
>>
>> Although this job is "suspended", it is still running on compute-0-8
>> and taking up four CPUs.
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206003
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
>> ].
>>
>
> -- 
> Dan Gruhn
> Group W Inc.
> 8315 Lee Hwy, Suite 303
> Fairfax, VA, 22031
> PH: (703) 752-5831
> FX: (703) 752-5851
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206005
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206007

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list