[GE users] Load average problem again (addition)

Anand S Bisen vmlinuz at abisen.com
Sun Aug 15 03:01:21 BST 2004

This is my regular queue that I am using for my application.

qname                arjun001.normal.q
hostname             arjun001.xyz.xyz
seq_no               6800
load_thresholds      np_load_avg=24.75
suspend_thresholds   NONE
nsuspend             1
suspend_interval     00:05:00
priority             0
min_cpu_interval     00:05:00
processors           UNDEFINED
qtype                BATCH INTERACTIVE PARALLEL 
rerun                FALSE
slots                50
tmpdir               /tmp
shell                /bin/csh
shell_start_mode     NONE
prolog               NONE
epilog               NONE
starter_method       NONE
suspend_method       NONE
resume_method        NONE
terminate_method     NONE
notify               00:00:60
owner_list           NONE
user_lists           NONE
xuser_lists          NONE
subordinate_list     NONE
complex_list         xyzcluster
complex_values       xyzclustername=arjun
projects             NONE
xprojects            NONE
calendar             NONE
initial_state        default
fshare               0
oticket              0
s_rt                 INFINITY
h_rt                 INFINITY
s_cpu                INFINITY
h_cpu                INFINITY
s_fsize              INFINITY
h_fsize              INFINITY
s_data               INFINITY
h_data               INFINITY
s_stack              INFINITY
h_stack              INFINITY
s_core               INFINITY
h_core               INFINITY
s_rss                INFINITY
h_rss                INFINITY
s_vmem               INFINITY
h_vmem               INFINITY 

-----Original Message-----
From: Anand S Bisen [mailto:vmlinuz at abisen.com] 
Sent: Saturday, August 14, 2004 8:59 PM
To: users at gridengine.sunsource.net
Subject: [GE users] Load average problem again 


I have a problem setting up my SGE properly, we have a setup of SGEE_5.3 on
our dual Pentium 4 Xeon 40 node cluster. The cluster is working on
bioinformatics applications that are developed using perl scripts that call
each other and wait for each other to finish. Hence at any given point of
time there are many executing scripts that are actually waiting and this
increases the load average artificially. If I increase my load_threshold on
my queue's the load increases to that load and then it is limited by the
total number of slots so on a dual processor machines if I bump up the
number of slots to 50 the load of the whole system goes up to 50 but still
the system is very responsive and the CPU's are only 50% used 50% idle.
Somehow on my linux 2.4.19x based boxes np_load_Average is not the right
parameter to get the load but how can I setup my queue's for this particular
application what should be the number of slots and load threshold. 



-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Tuesday, August 10, 2004 5:31 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Calculation of load average accurately


to my opinion, the load_threshold is most useful on a SMP machine with e.g.
64 CPUs and you know, that not all parallel-programs are running in parallel
all the time. Then you could create one queue with about 72 slots, and set
the load_threshold to 64.

When you have dual machines, the setup with one queue and two slots is just
okay, and you could delete the entry for load_threshold from the queue
definition. If you want to have more than one queue (for reasons of
organization of the setup) and limit the total number of jobs on one machine
to the number of CPUs (i.e. 2), you could create a complex cpu_slots and set
it for all nodes to two.

#name            shortcut   type   value           relop requestable
cpu_slots        cu         INT    0               <=    NO         YES

##--- # starts a comment but comments are not saved across edits

For each node:

complex_values             cpu_slots=2

This way, there will be always a limit of two jobs on each machine. I hope,
this is what you want to achieve.

Cheers - Reuti

>What should be the correct way to define the load average in the sun 
>grid engine 5.3ee. Currently on my cluster that consists of 64 node all 
>Pentium 4 3.2 GHz processors we are using np_load_average as the method 
>for load formula and the threshold that is set as of now is 1.75.
>what should be the load formula (np_load_average) what should be the 
>adjustment ?? 0.50 load threshold np_load_Average 1.75 and new jobs are 
>not submitted to the queue if the np_load_Average is > 1.75 on any of the
>where as if i log on my compute nodes i see that the nodes are very 
>the cpu's are mostly idle since the jobs only starts and use 10-20% of 
>each CPU. And when i locally execute programs to creat artificial load 
>the load average goes to 5 and even 7 and that is when i see my node a
little busy.

BTW: Load adjustment is to create artifical load, so that the load average
is immediately after starting of a job higher, to avoid that another job is
scheduled to the machine. It will decay over time (until the load average
reflects the usage of the machine), which you setup in the scheduler. This
could also be removed with the above setup:

job_load_adjustments       NONE
load_adjustment_decay_time 0:0:00

>Another thing that i noticed after which i saw the under utilization of 
>my cluster is that once i do a channel bonding (that is teaming up two 
>NIC cards to act as one) the load average on my linux boxes jumped to 
>1.0 1.0 1.0 as minimum when there is no processes running and i see the 
>cpu's as 100% free. But this affected the number of jobs that were 
>being submitted
>the node because sun grid engine thought that the node is already loaded. 
>So my question is is there any other way to evaluate the load on a node 
>or how should i go about setting a right threshold for a dual Pentium 
>IV (3.2
>GHz) what is set to 1.75 right now.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list