[GE users] wildly innacurate cpu usage, SGE 6.0u4

Reuti reuti at staff.uni-marburg.de
Thu Jan 24 22:18:57 GMT 2008


Hi,

Am 24.01.2008 um 17:06 schrieb SLIM H.A.:

> The investigations Lydia mentioned were related to a problem I found
> with the wallclock time reported by qacct for parallel jobs with tight
> integration.
>
> I want to report the wallclock time of a job, multiplied with the  
> number
> of slots, as that is the actual time the resources are not  
> available for
> other work. I am not using cpu time as it may be correct if a  
> program is
> efficient but that is not always the case.
>
> So I asked qacct to print the parallel environment and number slots  
> for
> a period of time. These are printed with a heading like this:
>
> OWNER   PROJECT          PE      SLOTS     WALLCLOCK         UTIME
> STIME           CPU
> etc.
>
> The heading WALLCLOCK is misleading. The time quoted there for a tight
> integrated job is not the wallclock time for the job, eg for instance
> that of the master process, but the sum of the ru_wallclock values for
> all the entries of that particular job in the accounting file. The
> number of entries is the number of nodes used plus one for the master.
> For example in one case a 32 slot job used 8 nodes with 4 cores and  
> this
> gave 8+1 entries in the accounting file. The ru_wallclock for each  
> node

one master job plus eight qrsh -inherit invocations.

> was ca 1120 and the cpu was ca 4440. qacct reported the CPU value
> correctly as 8*4440 but the WALLCLOCK as 9*1120 which is not what I
> would expect.

Every job and in addition every tight "qrsh -inherit" will create an  
accounting record - this is what you observe. Depending on your  
needs, this might or might not be what you want. In other  
cirumstances it can be useful to have it the way it is to justify,  
whether the job was qrsh-ing to all granted nodes all the time or not.

If you want to honor only the master job, you can filter out all  
"qrsh -inherit", as their submit time is always zero and set their  
wallclock time to 0:

$ qacct -pe -f <(awk 'BEGIN {FS=":"; OFS=":"} ($9 == 0) {$14 = 0}  
{print $0}'  /usr/sge/default/common/accounting)

Maybe you can look into the ARCo tool to phrase more specialized  
queries.

HTH - Reuti


> This causes me a bit of trouble as now I have to parse the accounting
> file to get the correct wallclock time.
>
> Does this qualify as a bug?
>
> Henk
>
>> -----Original Message-----
>> From: Lydia Heck [mailto:lydia.heck at durham.ac.uk]
>> Sent: 24 January 2008 13:50
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] wildly innacurate cpu usage, SGE 6.0u4
>>
>>
>> Hi Aaron,
>>
>> one of my colleagues, Henk Slim, has experienced the same
>> problem and so we started to investigate:
>>
>> It turns out that on a parallel job, the wallclock time is
>> calculated per node which participates on that job, and is
>> calculated from when all the slot processes of that job for
>> that node start and when all the slot processes of the job on
>> that node finish, irrespective of the number of slots used on
>> that node for that job.
>>
>> The final wallclock time is then  sum_1,number_of_nodes
>> wallclock_node + wallclock on master node.
>>
>> If you then multiply the wallclock time reported by qacct
>> with the number of slots for the job you would get a totally
>> wrong number for the resources.
>>
>> Lydia
>>
>>
>> On Thu, 24 Jan 2008 aaron at cs.york.ac.uk wrote:
>>
>>> Dear all,
>>>
>>> I was doing some detailed analysis of the job mix on our
>> system from
>>> the past year to find out if the resources offered match those we
>>> provide so as to inform future purchasing decisions. At first it
>>> looked that our resources did not match from analyses run on the
>>> accounting file.
>>>
>>> On closer analysis, however, it seems that in a very few
>> instances the
>>> cpu time used exceeded by order(s) of magnitude the
>> ru_wallclock*slots
>>> time. Has anyone else seen this, and in what circumstances?
>> The jobs
>>> affected seem to have failed.
>>>
>>> Regards, Aaron Turner
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ------------------------------------------
>> Dr E L  Heck
>>
>> University of Durham
>> Institute for Computational Cosmology
>> Ogden Centre
>> Department of Physics
>> South Road
>>
>> DURHAM, DH1 3LE
>> United Kingdom
>>
>> e-mail: lydia.heck at durham.ac.uk
>>
>> Tel.: + 44 191 - 334 3628
>> Fax.: + 44 191 - 334 3645
>> ___________________________________________
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list