[GE users] Problems with PEs and resource quotas

mdsteeves mdsteeves at gmail.com
Mon Dec 13 20:27:05 GMT 2010

We're running SGE 6.2u4 on RHEL5.4.

We've set up Olesen to help users run jobs on the cluster that require 
FLEXlm licenses, and would also like to be able to set up a resource 
quota so that when users launch jobs they're not able to lock up all of 
the licenses:

    name         moe_limit
    description  limit everyone to no more than 20 moe license
    enabled      TRUE
    limit        users {*} to moe=20

For some reason, though, we're running into problems with some users 
that submit jobs that use PEs, and also request certain resources with 
the "-l" switch get stuck in a qw state, and the message references the 
resource quota:

scheduling info:            queue instance "mpi.q at compute-1-25.local" 
dropped because it is disabled
                             queue instance "himem.q at compute-0-11.local" 
dropped because it is disabled
                             queue instance "mpi.q at compute-1-26.local" 
dropped because it is full
                             cannot run in queue "himem.q" because it is 
not contained in its hard queue list (-q)
                             cannot run because it exceeds limit 
"steevmi1/////" in rule "moe_limit/1"
                             cannot run in PE "orte" because it only 
offers 0 slots

For testing, I've been using the following script:


#$ -S /bin/ksh
#$ -j y
#$ -cwd
#$ -q mpi.q
#$ -pe orte 8
#$ -N mdsTest
##  The following all work:
##  #$ -l h_cpu=1
##  #$ -l mem_total=5G
##  #$ -l arch=lx26-amd64
##  #$ -l moe=1
##  Any of the following do not work, and cause the job to hang in the 
##  #$ -l q=mpi.q
##  #$ -l hostname="compute-0-2"
##  #$ -l 

sleep 300

Even switching from "-q mpi.q" to "-masterq mpi.q" doesn't help any. If 
we disable the resource quota rule, then the jobs run without any 
problems. Is there something that we're missing?

Michael Steeves (mdsteeves at gmail.com)


