[GE users] Resource Quotas causing problems for a single user

m0zes adam.tygart at gmail.com
Sat Sep 5 02:03:49 BST 2009

Hello everyone,

I seem to be having some issues with a recent change to my cluster
setup. Previously, the cluster had one queue (batch.q), with a time
restriction setup for @somenodes. I have now modified the setup to
include batch.q, long.q, highmem.q, and long-highmem.q. I have been
attempting to restrict things across queues using resource quotas. I
made this complex change last Monday, and it worked until 4:00 today.
The resource quotas that I have been using are:

   name         max_slots_per_user
   description  "Set the maximum number of slots a user can utilize at once"
   enabled      TRUE
   limit        users {*} to slots=700
   name         max_slots_per_host
   description  NONE
   enabled      TRUE
   limit        hosts {@titans} to slots=16
   limit        hosts {@brutes-small} to slots=4
   limit        hosts {@brutes-large} to slots=8
   limit        hosts {@scouts} to slots=8
   limit        hosts {@rogues} to slots=8
   limit        hosts {@fiends} to slots=4
   name         max_slots_per_queue
   description  NONE
   enabled      TRUE
   limit        queues batch.q to slots=1000
   limit        queues test.q to slots=1000
   limit        queues special.q to slots=1000
   limit        queues long-highmem.q to slots=600
   limit        queues highmem.q to slots=350
   limit        queues long.q to slots=250
   name         max_mem_per_host
   description  NONE
   enabled      TRUE
   limit        hosts {@titans} to memory=64G
   limit        hosts {@brutes-small} to memory=16G
   limit        hosts {@brutes-large} to memory=32G
   limit        hosts {@scouts} to memory=8G
   limit        hosts {@rogues} to memory=8G
   limit        hosts {@fiends} to memory=8G

Now when user1 submits a job, the job won't get executed. qstat -j
$jobnum gives this output:

cannot run because it exceeds limit "user1/////" in rule "max_slots_per_user/1"
cannot run because it exceeds limit "user1/////" in rule "max_slots_per_user/1"
cannot run in PE "single" because it only offers 0 slots

This is impossible, as qquota -u \* shows that user1 is not using any
of his slot quota
resource quota rule limit                filter
max_slots_per_user/1 slots=4/700          users user2
max_slots_per_user/1 slots=58/700         users user3
max_slots_per_user/1 slots=16/700         users user4
max_slots_per_host/1 slots=2/16           hosts titan5
max_slots_per_host/1 slots=1/16           hosts titan8
max_slots_per_host/4 slots=3/8            hosts scout62
max_slots_per_host/4 slots=8/8            hosts scout74
max_slots_per_host/4 slots=8/8            hosts scout78
max_slots_per_host/4 slots=6/8            hosts scout70
max_slots_per_host/4 slots=4/8            hosts scout65
max_slots_per_host/4 slots=8/8            hosts scout69
max_slots_per_host/4 slots=8/8            hosts scout63
max_slots_per_host/4 slots=8/8            hosts scout77
max_slots_per_host/4 slots=6/8            hosts scout55
max_slots_per_host/4 slots=8/8            hosts scout73
max_slots_per_host/4 slots=8/8            hosts scout76
max_slots_per_queue/1 slots=78/1000        queues batch.q
max_mem_per_host/1 memory=12.000G/64.00 hosts titan5
max_mem_per_host/1 memory=6.000G/64.000 hosts titan8
max_mem_per_host/4 memory=8.000G/8.000G hosts scout62
max_mem_per_host/4 memory=8.000G/8.000G hosts scout74
max_mem_per_host/4 memory=2.000G/8.000G hosts scout78
max_mem_per_host/4 memory=6.000G/8.000G hosts scout70
max_mem_per_host/4 memory=4.000G/8.000G hosts scout65
max_mem_per_host/4 memory=8.000G/8.000G hosts scout69
max_mem_per_host/4 memory=8.000G/8.000G hosts scout63
max_mem_per_host/4 memory=8.000G/8.000G hosts scout77
max_mem_per_host/4 memory=6.000G/8.000G hosts scout55
max_mem_per_host/4 memory=8.000G/8.000G hosts scout73
max_mem_per_host/4 memory=2.000G/8.000G hosts scout76

The next line of the qstat -j output is odd to me, too:
cannot run in PE "single" because it only offers 0 slots

Again, this shouldn't happen as the PE "single" contains 10000 slots,
(10 times the number of cores in the cluster).

I have tried restarting qmaster, it didn't seem to have any effect. I
cannot restart the execd services on the nodes at the moment, as some
of them are still loaded.

Anyone have any thoughts about this lengthy/complex setup?



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list