[GE users] Theoretical question about wallclock and qacct

icaci hristo at mc.phys.uni-sofia.bg
Wed Mar 4 00:21:20 GMT 2009


Hi,

For me the accouting works as such: your parallel job spawns several  
hosts and on each host there is a SGE shepherd that manages the part  
of the job running on that host. Each shepherd sums the CPU time of  
its children and then writes it to the accounting file alongside with  
the ru_wallclock that is measured by its own lifetime. Then the  
average CPU utilisation on that host by the given job is
(sum of children CPU time) / (ru_wallclock * host slots)
Unfortunately shepherds do not record the number of slots allocated on  
the host (though it is available in the pe_hostfile in runtime). But  
you can sum up the CPU time from all accounting entries for the given  
job and then divide that by the ru_wallclock of the master shepherd  
times the total number of slots (that one gets recorded in the  
accounting file) to get the average CPU utilisation for the parallel  
job as a whole. That makes sense in homogenous setups where each exec  
host defines a value for the 'slots' complex equal to the number of  
its CPU cores since all job slots are allocated for the duration of  
whole job. At least that's how we analyse our cluster usage using a  
simple Python script to parse the SGE accounting file and do the  
computations mentioned. It is also possible to parse the output of  
'qacct -j' for the same information but it's not directly obvious (at  
last not to me) which entry is from the master shepherd (probably the  
one with the largest ru_wallclock value?). In the accounting file its  
record has NONE in the field before the last one (in 6.2 there are two  
more fields).

You can also write a start_proc (or stop_proc, or prolog/epilog) for  
the OpenMPI PE to copy the pe_hostfile ($PE_HOSTFILE) somewhere safe  
for later analysis. Then you can combine the slots allocation from the  
hostfile with the accounting data to extract the exact number of slots  
allocated on each host at a given time. Or you can just use the  
verbose reporting functionality of SGE together with ARCo. You can  
also deploy Ganglia or similar cluster monitoring tool to get an  
overview of the utilisation.

Hope that helps.

-- Hristo

On 03.03.2009, at 22:22, mhanby wrote:

> This test was spawned by one of our grant writers asking "Is there a  
> way
> for me to query grid engine to figure out how utilized the cluster is
> from day to day, week to week or month to month?"
>
> Based on what's in the accounting file, I don't see how that's  
> possible.
> I was thinking along the lines of trying to figure out the theoretical
> max number of compute time based on number of slots, and use that
> combined with actual usage to determine what percentage of the max was
> used.
>
> By the way, my accounting file has 5 entries, 2 for the master host  
> and
> 3 for the remaining compute nodes that were used. I must have  
> miscounted
> before, it actually ran on 4 hosts, not 5, which would make sense  
> since
> each host has 8 cores / slots.
>
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Tuesday, March 03, 2009 12:49 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Theoretical question about wallclock and qacct
>
> Am 03.03.2009 um 16:14 schrieb mhanby:
>
>> I have a theoretical question regarding the number presented by qacct
>> for WALLCLOCK. I'm running GE 6.1u5 with OpenMPI 1.2.8 compiled on  
>> the
>> head node (so OpenMPI should be GE aware).
>>
>> As a test, I ran a 32 slot OpenMPI job that had a total runtime of 60
>> minutes. The WALLCLOCK reported in the email delivered after job
>> completion was 1:00:04 hours.
>>
>> The qacct command for that same job reports 18020, which translates
>> to ~
>> 5:00:05 hours.
>
> Is there only one record in the accounting file for this job? With a
> Tight Integration you should get 6 - one for the jobscript and one
> for each qrsh made.
>
>> The job ran on 5 hosts, so it appears that the WALLCLOCK is only
>> recording the seconds on each host and not each CPU / slot?
>>
>> Is this the way it's supposed to work, or is this a tight vs loose
>> integration thing?
>>
>> I would have expected the WALLCLOCK for the 32 slot job to be ~
>> 32:00:00
>> hours
>
> For 6.2 it's an issue for the new ability to summarize the accounting
> records automatically:
>
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=2787
>
> Although it's still open to discuss to sum up the walltime at all.
> You could even argue that it should be just the time passed by, i.e.
> 1 hr in your case. So upgrading woudn't help in yoru case anyway.
>
> -- Reuti
>
>
>> Thanks,
>>
>> Mike
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=119640
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=119735
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=119789
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=119954

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list