[GE users] load/suspend Thresholds problem

Rene Salmon rsalmon at tulane.edu
Wed Sep 22 18:40:42 BST 2004


Hi,

I am running SGE 6.0u1 on AMD64.  Here is the setup:
I have two dual processor machines compute-0-0 and crash 
both are execution hosts.
I have two cluster queues "all.q" and "qtest"
and two queue instances "compute-0-0.local" and "crash.local".

This is what it lookslike:
all.q at compute-0-0.local
all.q at crash.local
qtest at compute-0-0.local
qtest at crash.local


both cluster queues and instance queues have 

load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE


But for some reason the hosts do not stop accepting jobs once the threshold
is reached. 


The load on the each hosts is about 3.64 and each host is running 
about 3 jobs but they still keep accepting more jobs.  The hosts do
not start rejecting jobs even after the system load is above 1.75.

>qstat

job-ID  prior   name       user         state submit/start at     queue
slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
     25 0.56000 MyJob      rsalmon      r     09/22/2004 12:08:37
     all.q at compute-0-0.local            1        
     26 0.56000 MyJob      rsalmon      r     09/22/2004 12:08:37
     all.q at compute-0-0.local            1        
     24 0.56000 MyJob      rsalmon      r     09/22/2004 12:08:37
     all.q at crash.local                  1        
     27 0.56000 MyJob      rsalmon      r     09/22/2004 12:08:37
     all.q at crash.local                  1        
     29 0.56000 MyJob      rsalmon      r     09/22/2004 12:08:37
     qtest at compute-0-0.local            1        
     28 0.56000 MyJob      rsalmon      r     09/22/2004 12:08:37
     qtest at crash.local                  1        
     30 0.56000 MyJob      rsalmon      r     09/22/2004 12:10:52
     qtest at crash.local                  1        


>qhost

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO
SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -
-
compute-0-0             lx24-amd64      2  1.52    1.9G  174.1M  996.2M
0.0
crash                   lx24-amd64      2  1.69    1.9G  262.8M    5.3G
0.0


qhost reports a load of about 1.69  but the actual load on the system 
is 3.64 (from uptime).  Any ideas?


This was working fine when I only had one cluster queue "all.q"
after I added the second cluster queue "qtest" then the problem 
started.

Thank you for any help
Rene





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list