[GE users] Load average problem again

Anand S Bisen vmlinuz at abisen.com
Sun Aug 15 02:59:08 BST 2004


Hello,

I have a problem setting up my SGE properly, we have a setup of SGEE_5.3 on
our dual Pentium 4 Xeon 40 node cluster. The cluster is working on
bioinformatics applications that are developed using perl scripts that call
each other and wait for each other to finish. Hence at any given point of
time there are many executing scripts that are actually waiting and this
increases the load average artificially. If I increase my load_threshold on
my queue's the load increases to that load and then it is limited by the
total number of slots so on a dual processor machines if I bump up the
number of slots to 50 the load of the whole system goes up to 50 but still
the system is very responsive and the CPU's are only 50% used 50% idle.
Somehow on my linux 2.4.19x based boxes np_load_Average is not the right
parameter to get the load but how can I setup my queue's for this particular
application what should be the number of slots and load threshold. 

Thanks

Anand


-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Tuesday, August 10, 2004 5:31 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Calculation of load average accurately

Hi,

to my opinion, the load_threshold is most useful on a SMP machine with e.g.
64 CPUs and you know, that not all parallel-programs are running in parallel
all the time. Then you could create one queue with about 72 slots, and set
the load_threshold to 64.

When you have dual machines, the setup with one queue and two slots is just
okay, and you could delete the entry for load_threshold from the queue
definition. If you want to have more than one queue (for reasons of
organization of the setup) and limit the total number of jobs on one machine
to the number of CPUs (i.e. 2), you could create a complex cpu_slots and set
it for all nodes to two.


#name            shortcut   type   value           relop requestable
consumable 
default
#---------------------------------------------------------------------------
--
--------
cpu_slots        cu         INT    0               <=    NO         YES

1    
##--- # starts a comment but comments are not saved across edits
-----------------------


For each node:

complex_values             cpu_slots=2


This way, there will be always a limit of two jobs on each machine. I hope,
this is what you want to achieve.


Cheers - Reuti



>What should be the correct way to define the load average in the sun grid
>engine 5.3ee. Currently on my cluster that consists of 64 node all with
dual
>Pentium 4 3.2 GHz processors we are using np_load_average as the method for
>load formula and the threshold that is set as of now is 1.75.
> 
>what should be the load formula (np_load_average) what should be the
>adjustment ?? 0.50 load threshold np_load_Average 1.75 and new jobs are not
>submitted to the queue if the np_load_Average is > 1.75 on any of the node.
>where as if i log on my compute nodes i see that the nodes are very free
and
>the cpu's are mostly idle since the jobs only starts and use 10-20% of each
>CPU. And when i locally execute programs to creat artificial load the load
>average goes to 5 and even 7 and that is when i see my node a little busy. 

BTW: Load adjustment is to create artifical load, so that the load average
is 
immediately after starting of a job higher, to avoid that another job is 
scheduled to the machine. It will decay over time (until the load average 
reflects the usage of the machine), which you setup in the scheduler. This 
could also be removed with the above setup:

job_load_adjustments       NONE
load_adjustment_decay_time 0:0:00


>Another thing that i noticed after which i saw the under utilization of my
>cluster is that once i do a channel bonding (that is teaming up two NIC
>cards to act as one) the load average on my linux boxes jumped to 1.0 1.0
>1.0 as minimum when there is no processes running and i see the cpu's as
>100% free. But this affected the number of jobs that were being submitted
to
>the node because sun grid engine thought that the node is already loaded. 
> 
>So my question is is there any other way to evaluate the load on a node or
>how should i go about setting a right threshold for a dual Pentium IV (3.2
>GHz) what is set to 1.75 right now.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list