[GE users] Fwd: subnode with empty slots but jobs in queue

reuti reuti at staff.uni-marburg.de
Mon Dec 6 18:04:49 GMT 2010


Am 06.12.2010 um 18:47 schrieb jlforrest:

> On 12/6/2010 1:38 AM, reuti wrote:
> 
>>> I have a subnode that is currently using 7 out of its 8 slots.  I
>>> have jobs waiting in the queue, but they will not start processing.
>>> Everything was working fine a couple weeks ago, and then it just
>>> stopped.
>> 
>> the load_threshold can also be set to none, when cores = slots.
>> 
>> Did you define/request any memory or other resource? Any resource
>> quota set in place?
>> 
>> The waiting jobs are serial ones?
> 
> I have a similar problem with SGE 6.2u4. I have a node
> with 48-cores which will only run 30 jobs. Here is the
> relevant output from qconf:
> 
> ---
> hostlist              @allhosts
> seq_no                0
> load_thresholds       NONE
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make mpi mpich orte
> rerun                 FALSE
> slots                 1,[compute-0-0.local=4],[compute-0-1.local=4], \
>                       [compute-0-2.local=4],[compute-0-3.local=4], \
>                       [compute-0-5.local=4],[compute-0-4.local=4], \
>                       [compute-0-6.local=4],[compute-0-7.local=48], \
>                       [compute-0-8.local=48]
> ---
> 
> Right now compute-0-8 is down, although qstat still shows
> some jobs for it. (Why would this happen?)

SGE assumes some network problems. You will have to use `qdel -f ...` to get rid of these jobs.


> The qstat output for compute-0-7 shows
> 
> all.q at compute-0-7.local        BIP   0/48/48        29.05    lx26-amd64

So, all 48 out of 48 seem to be used up.


> and then it shows 48 serial jobs underneath! Yet, ssh-ing to
> compute-0-7 and running ps clearly only shows 29 jobs running

What is `qstat -g t -l h=compute-0-7.local -s r` showing?

-- Reuti


> All the jobs in this cluster are serial jobs. Any idea why
> I can't run 18 more jobs on compute-0-7? I restarted the
> qmaster but it didn't make any difference.
> 
> Cordially,
> 
> 
> -- 
> Jon Forrest
> Research Computing Support
> College of Chemistry
> 173 Tan Hall
> University of California Berkeley
> Berkeley, CA
> 94720-1460
> 510-643-1032
> jlforrest at berkeley.edu
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302517
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302521

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list