[GE users] Knowing the historic number of jobs in a que

Brett_W_Grant at raytheon.com Brett_W_Grant at raytheon.com
Fri May 26 20:28:03 BST 2006


It is a little confusing as to what they are measuring, but I believe that 
they add up all the cpu seconds that all of the jobs in one week took and 
divide that by the possible number of cpu seconds in a week.  Now that I 
think about it, I have no idea if they use user time or wall clock time. I 
am also not sure of how they adjust for number of slots, computers, etc. 
We don't run any parallel jobs.  We have two different styles of job. One, 
we create a grid of points to test.  Each point takes about a couple of 
minutes to run.  The other style is that we have a wrapper program that 
tries to intelligently "pick" the points that we simulate.  With the 
shotgun approach, they bump the slots up to 2.5x the number of processors 
in the machine.  That seems to work fairly well.  With the wrapper style, 
I find that if we have more slots than processors, we can keep the 
usertime above 90%, but a job that would normally take an hour will take 3 
hours.

I guess that I need to come up with some way to prove that total wallclock 
time for my project is more important than showing processor usertime is 
always greater than 90%


Thanks for the replies, I will try some of them out.

Brett Grant





Reuti <reuti at staff.uni-marburg.de> 
05/26/2006 12:00 PM
Please respond to
users at gridengine.sunsource.net


To
users at gridengine.sunsource.net
cc

Subject
Re: [GE users] Knowing the historic number of jobs in a que






Hi,

Am 26.05.2006 um 20:35 schrieb Brett_W_Grant at raytheon.com:

>
> Is there a way to know the number of jobs that are sitting in the 
> que?  My IT dept says that we are only running at 50%.  I think, 
> but cannot prove, that we have had jobs sitting in the que during 
> this time.  From my point of view, the cluster is at 100% if jobs 
> are waiting to run.  I can understand IT saying that it's computers 
> weren't busy, but I would like to show that from our standpoint, we 
> were waiting for the computer and not that the computers were 
> sitting idle.
>
> I can use qacct to find info about jobs that ran, but not about 
> jobs that were qued up.  Is their some other file, or method that 
> will let my what was qued?  The only other thing that I can think 
> of is to put some form of qstat command in a cron job, but I was 
> hoping that there was a file somewhere there that would help me out.

it depends what the IT dept mean by 100%. If it's just the load of 
all machines summarized and scaled to the number of CPUs inside the 
cluster they may be right, even if you have waiting jobs. This might 
happen with parallel running applications, which are not all the time 
running in parallel, but have serial steps from time to time.

I don't know whether this applies to your usage of the cluster, but 
in short we are therefore oversubscribing the nodes, and have one 
serial slot (nice 19) and one parallel slot (nice 0) per core. If a 
parallel jobs wants to run, it has preference, and during the other 
times the serial job will run. With this setup you can come close to 
100% all the time.

To check the load in the cluster, you could adjust this small script 
to your environment, i.e. name of the nodes:

$ cat cload
#!/bin/sh

qhost | awk ' BEGIN                   { load = 0; cpus = 0 }

               /^node[0-9][0-9]/       { if ($4 != "-")
                                            { load += ($4 < $3) ? 
$4 : $3 }
                                         cpus += $3
                                       }

               END                     { printf "Total Cluster load: 
%.1f (%.2f%)\n", load, load * 100 / cpus } '

(Tthe "CPUs" are 100% busy if the load is their count in a machine or 
higher, as it's just the number of waiting processes eligle to run 
[okay, with the new Linux kernels also D processes are in this sum - 
hopefully it will not influence the average too much].)

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





More information about the gridengine-users mailing list