[GE users] Monitoring Softwares...

Andy Schwierskott andy.schwierskott at sun.com
Mon Mar 21 08:52:38 GMT 2005


Hi,

there a much more tuneable and elegant way to get the values in a database
(this mechanism is used by Sun's ARCo which Charu mentioned in his reply).

Via the cluster config parameter "reporting_params" (sge_conf(5)) the
accounting entrie and arbitrary load values can be written in a flat ASCII
file. See reporting(5) for more information about the file format.

ARCo uses a small Java program "dbwriter" which write the data in a Postgres
or Oracle database from were the web frontend of Aroc allows to define
queries. See the N1 Grid Engine documentation from docs.sun.com for
examples.

It's up to dbwriter to purge old values the reporting file. This means that
this mechanism helps the loss of monitoring data if dbwriter or the database
is not up.

Andy

> Sriram
>
> FWIW ...
>
> We simply load the accounting file into a MySQL database, currently once a 
> month but that's only because I only do the management reports once a month, 
> there's no good reason why we couldn't do it more or less frequently.
>
> Once the accounting data is in MySQL its very easy to get out the sort of 
> information you want.  I have some sql scripts which prepare tables of data 
> for the mgmt reports, and use Matlab for the graphics - but Excel would do 
> just as well.
>
> I wrote a shell script to read through the jobs in any period and determine 
> how many CPUs were in use at any time.  That was the hardest part of the set 
> up.  The 'logic' is easy enough, but it doesn't translate very easily into 
> SQL.  The script does something like:
>
> - select all jobs whose execution time overlaps with the period of interest 
> which might be a day, a month, a week, even an hour;
>
> - decide the sampling interval; for monthly reports the sampling interval is 
> hourly; but the script can sample second-by-second
>
> - for each sampling interval, for each job (this is a horrible hack but it 
> works) if the job end time is later than the start of the sampling interval, 
> add one to the total number of CPUs in use that hour;
>
> - keep looping until the analysis is complete, file the data and plot a graph 
> of hour-by-hour usage for the month.
>
> Hope this is some use to you.  I guess you could load the data directly into 
> Excel or Matlab or whatever your favourite analysis package is, but a 
> database gives you a lot of flexibility.  I have tables in the database for 
> clients and projects so that my reports can show how much usage we're making 
> of the cluster for each client, each user, that sort of thing.
>
> Regards
> Mark
>
> Sriram Sitaraman wrote:
>> Hi
>> 
>> 	Seems like this question has come up a few time with no real
>> good solution. Is there "SGE" related monitoring system that
>> consolidates some important values like
>> 
>> 	Machine load
>> 	Machine CPU %
>> 	Mem_Total
>> 	Mem_Free
>> 
>> 	Jobs Submitted
>> 	CPU/ Per User
>> 	Jobs Pending
>> 	Average Turn around Time / Average Wait Time/ Average Run Time 	Job 
>> based timings
>> 		Idle Jobs
>> 	Job Jobs
>> 
>> We have been working on a interface, but managing the accounting file is
>> very hard, as it grows very fast. Also we are on version 6.0. Currently
>> some of the systems out there seem to be more cluster centric, but not
>> related to SGE. 
>> Any help..
>> 
>> Sriram

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list