[GE users] Knowing the historic number of jobs in a que

Reuti reuti at staff.uni-marburg.de
Fri May 26 20:00:47 BST 2006


Am 26.05.2006 um 20:35 schrieb Brett_W_Grant at raytheon.com:

> Is there a way to know the number of jobs that are sitting in the  
> que?  My IT dept says that we are only running at 50%.  I think,  
> but cannot prove, that we have had jobs sitting in the que during  
> this time.  From my point of view, the cluster is at 100% if jobs  
> are waiting to run.  I can understand IT saying that it's computers  
> weren't busy, but I would like to show that from our standpoint, we  
> were waiting for the computer and not that the computers were  
> sitting idle.
> I can use qacct to find info about jobs that ran, but not about  
> jobs that were qued up.  Is their some other file, or method that  
> will let my what was qued?  The only other thing that I can think  
> of is to put some form of qstat command in a cron job, but I was  
> hoping that there was a file somewhere there that would help me out.

it depends what the IT dept mean by 100%. If it's just the load of  
all machines summarized and scaled to the number of CPUs inside the  
cluster they may be right, even if you have waiting jobs. This might  
happen with parallel running applications, which are not all the time  
running in parallel, but have serial steps from time to time.

I don't know whether this applies to your usage of the cluster, but  
in short we are therefore oversubscribing the nodes, and have one  
serial slot (nice 19) and one parallel slot (nice 0) per core. If a  
parallel jobs wants to run, it has preference, and during the other  
times the serial job will run. With this setup you can come close to  
100% all the time.

To check the load in the cluster, you could adjust this small script  
to your environment, i.e. name of the nodes:

$ cat cload

qhost | awk ' BEGIN                   { load = 0; cpus = 0 }

               /^node[0-9][0-9]/       { if ($4 != "-")
                                            { load += ($4 < $3) ?  
$4 : $3 }
                                         cpus += $3

               END                     { printf "Total Cluster load:  
%.1f (%.2f%)\n", load, load * 100 / cpus } '

(Tthe "CPUs" are 100% busy if the load is their count in a machine or  
higher, as it's just the number of waiting processes eligle to run  
[okay, with the new Linux kernels also D processes are in this sum -  
hopefully it will not influence the average too much].)

-- Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list