[GE users] Accounting
mark.westwood at ohmsurveys.com
Tue Apr 19 10:12:40 BST 2005
We're currently on GE v5.3, planning to move to v6.x real soon, so my
experience may be only indirectly useful.
We run a 70 processor Opteron cluster, the o/s is Linux. When I took
over as cluster manager I (after giving myself the grand title 'cluster
manager') took on the task of producing useful (to 'management') reports
from the GE accounting file. I did what I guess you are doing - load
the accounting file into a home-brewed MySQL database for further
processing and report extraction.
Currently the process is a bit less automated than I would like, but
since we're upgrading soon I don't propose to change until the new
version is installed. I report monthly, so the couple of hours effort I
have to put in is not too onerous and it helps me to keep an eye on the
data. Here's what I do:
- Load the accounting file into a Tasks table; this table has one row
for each completed task. During this operation I massage the data a
bit, doing things like translating Unix timestamps into MySQL date-time
- From the Tasks table create a Jobs table which has one row for each
job - so for parallel jobs the Jobs table holds one row which aggregates
the data for the individual tasks. We have a policy in operation that
all jobs submitted to the cluster must have a project number attached
(ie grid engine rejects jobs without project numbers), so from the Jobs
table I can produce a number of useful summary reports - use by user and
by project, for whatever time period I'm interested in.
- Like you I'm interested in time series of cluster usage. Each month I
produce a time series of hour-by-hour CPU usage, that is a series of
numbers which shows, for each hour in a month, how many (out of 70)
processors were in use. First thing is to create (or delete the old
version of and re-create) a temporary table, called Hours, which will
eventually have one row for each hour in the month. In principal my
script will produce time series of usage for whatever period and
whatever interval I want, eg minute-by-minute for the last 3 weeks.
Now things are going to get very kludgy, so hang on tight:
-- I have a shell script which executes a loop once for each hour in the
-- for each hour, the script executes an SQL command to select all Jobs
which ran for at least part of the hour;
-- for each Job selected, calculate how many CPU-seconds (product of
number of CPUs and number of seconds) the Job consumed in the hour.
This requires checking whether the job execution time overlapped the
start and end of the hour in question to calculate the time the job ran
during that hour.
-- for each hour, sum the CPU-seconds consumed by all jobs which ran for
some time in that hour. Divide this sum by the number of CPU-seconds
available in that hour (ie 70 * 3600) and take the nearest integer to
that result. This number, the number of CPUs used in the hour in
question, is an integer between 0 and 70. It is written into the
temporary Hours table.
-- and so on for each hour in the month.
It ain't pretty but it works, and only takes a few minutes on my PC.
The result is the time series I wanted, which I can then import into,
say, Excel (although I also use Matlab here) for graphing. I then have
a nice graph of cluster usage for the month which I can include in my
reports. I also use Excel / Matlab for graphing the other data I extract
from the database.
Management then have the ammunition to wonder out loud why they bought
so many processors to be under-used, but that's another story.
If you want a copy of the shell script, email me directly.
Shaila Parashar wrote:
> We have a cluster consisting of 4 nodes running SGE 6.0u3 . The number
> of CPUs on these nodes are 24, 12, 8 and 4 . We have been running this
> cluster since Aug 2004 and are now in need of some statistics. I read
> the mailing lists and did get some ideas about the statistics. I also
> imported the accounting file into MySQL . I did manage to get statistics
> on the jobs - as average CPU time used, average waiting time, etc. But
> we need statistics similar to the following :-
> Number of CPU's used versus time. Also number oc CPUs used on each of
> the hosts vs time.
> Basically we need plots vs time.
> I wanted to know if this is possible from the SGE accounting files and
> if so how ?
> I would appreciate it if you can suggest of any other values that we
> can plot against time.
> Any ideas/suggestions on how to get these values ( if possible ) will be
> really appreciated.
The Technology Centre
Offshore Technology Park
+44 (0)870 429 6586
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users