[GE users] Jobs running but not using resources

Hugo Hernandez-Mora hugo.hernandez at loni.ucla.edu
Thu Oct 30 21:51:53 GMT 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello all,
We are experiencing a strange behavior in our cluster since the last weekend.  Most of the jobs running into our cluster (we have +300 SunFire 20Vz and 80 SunFIre X2200 with +3,500 available slots) are not using the resources as expected.   Indeed, most of them are not using the resources (0 CPU for the associated processes).  We have set the following resource limits:

{
   name         memory_usage
   description  Limit the memory used for all users (per machine type)
   enabled      TRUE
   limit        users {*} hosts {@v20zHosts} to mem_total=7g
   limit        users {*} hosts {@x2200Hosts} to mem_total=15g
   limit        users {*} to swap_total=10g
}
{
   name         sysadm_rule
   description  Restrict user user1 to use only 50 slots in queue0.q queue
   enabled      TRUE
   limit        users {user1} queues queue0.q to slots=50
}
{
   name         max_per_queue
   description  Limit the maximum allowed cluster queue slots per user
   enabled      TRUE
   limit        users {*} queues short.q to slots=672
   limit        users {*} queues medium.q to slots=192
   limit        users {*} queues long.q to slots=111
   limit        users {*} queues special.q to slots=1810
}

For the last limit, the max_per_queue, we are restricting the users to use all the available slots on the queues, preventing to monopolize the resources of the cluster.   The total of available slots per queue is:

myhost> qstat -g c
CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE
-------------------------------------------------------------------------------
long.q                            0.48    185      0    240     41     32
medium.q                          0.48      5     59    330    230     40
special.q                        0.57    134   1741   2190     10    325
short.q                           0.48    986      4   1140     24    142
queue0.q                          3.14    185      0    185    185      0

we have not done any changes on our configuration.  Does any of you have experienced a similar problems or can you just give me some hints about what to check?  Any help will be greatly appreciated.
Thanks in advance,

-Hugo

--
Hugo R. Hernandez-Mora
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.hernandez at loni.ucla.edu<mailto:hugo.hernandez at loni.ucla.edu>
--

"Si seus esfor?os, foram vistos com indefren?a, não desanime,
que o sol faze un espectacolo maravilhoso todas as manhãs
cuando a maior parte das pessoas, ainda estam durmindo"




More information about the gridengine-users mailing list