[GE users] Jobs running but not using resources

Hugo Hernandez-Mora
Thu Oct 30 21:51:53 GMT 2008

Hello all,
We are experiencing a strange behavior in our cluster since the last weekend.  Most of the jobs running into our cluster (we have +300 SunFire 20Vz and 80 SunFIre X2200 with +3,500 available slots) are not using the resources as expected.   Indeed, most of them are not using the resources (0 CPU for the associated processes).  We have set the following resource limits:

   name         memory_usage
   description  Limit the memory used for all users (per machine type)
   enabled      TRUE
   limit        users {*} hosts {@v20zHosts} to mem_total=7g
   limit        users {*} hosts {@x2200Hosts} to mem_total=15g
   limit        users {*} to swap_total=10g
   name         sysadm_rule
   description  Restrict user user1 to use only 50 slots in queue0.q queue
   enabled      TRUE
   limit        users {user1} queues queue0.q to slots=50
   name         max_per_queue
   description  Limit the maximum allowed cluster queue slots per user
   enabled      TRUE
   limit        users {*} queues short.q to slots=672
   limit        users {*} queues medium.q to slots=192
   limit        users {*} queues long.q to slots=111
   limit        users {*} queues special.q to slots=1810

For the last limit, the max_per_queue, we are restricting the users to use all the available slots on the queues, preventing to monopolize the resources of the cluster.   The total of available slots per queue is:

myhost> qstat -g c
long.q                            0.48    185      0    240     41     32
medium.q                          0.48      5     59    330    230     40
special.q                        0.57    134   1741   2190     10    325
short.q                           0.48    986      4   1140     24    142
queue0.q                          3.14    185      0    185    185      0

we have not done any changes on our configuration.  Does any of you have experienced a similar problems or can you just give me some hints about what to check?  Any help will be greatly appreciated.
