[GE users] wildly innacurate cpu usage, SGE 6.0u4

SLIM H.A. h.a.slim at durham.ac.uk
Thu Jan 24 16:06:25 GMT 2008


The investigations Lydia mentioned were related to a problem I found
with the wallclock time reported by qacct for parallel jobs with tight
integration.

I want to report the wallclock time of a job, multiplied with the number
of slots, as that is the actual time the resources are not available for
other work. I am not using cpu time as it may be correct if a program is
efficient but that is not always the case.

So I asked qacct to print the parallel environment and number slots for
a period of time. These are printed with a heading like this:

OWNER   PROJECT          PE      SLOTS     WALLCLOCK         UTIME
STIME           CPU
etc.

The heading WALLCLOCK is misleading. The time quoted there for a tight
integrated job is not the wallclock time for the job, eg for instance
that of the master process, but the sum of the ru_wallclock values for
all the entries of that particular job in the accounting file. The
number of entries is the number of nodes used plus one for the master. 
For example in one case a 32 slot job used 8 nodes with 4 cores and this
gave 8+1 entries in the accounting file. The ru_wallclock for each node
was ca 1120 and the cpu was ca 4440. qacct reported the CPU value
correctly as 8*4440 but the WALLCLOCK as 9*1120 which is not what I
would expect.

This causes me a bit of trouble as now I have to parse the accounting
file to get the correct wallclock time.

Does this qualify as a bug?

Henk

> -----Original Message-----
> From: Lydia Heck [mailto:lydia.heck at durham.ac.uk] 
> Sent: 24 January 2008 13:50
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] wildly innacurate cpu usage, SGE 6.0u4
> 
> 
> Hi Aaron,
> 
> one of my colleagues, Henk Slim, has experienced the same 
> problem and so we started to investigate:
> 
> It turns out that on a parallel job, the wallclock time is 
> calculated per node which participates on that job, and is 
> calculated from when all the slot processes of that job for 
> that node start and when all the slot processes of the job on 
> that node finish, irrespective of the number of slots used on 
> that node for that job.
> 
> The final wallclock time is then  sum_1,number_of_nodes  
> wallclock_node + wallclock on master node.
> 
> If you then multiply the wallclock time reported by qacct 
> with the number of slots for the job you would get a totally 
> wrong number for the resources.
> 
> Lydia
> 
> 
> On Thu, 24 Jan 2008 aaron at cs.york.ac.uk wrote:
> 
> > Dear all,
> >
> > I was doing some detailed analysis of the job mix on our 
> system from 
> > the past year to find out if the resources offered match those we 
> > provide so as to inform future purchasing decisions. At first it 
> > looked that our resources did not match from analyses run on the 
> > accounting file.
> >
> > On closer analysis, however, it seems that in a very few 
> instances the 
> > cpu time used exceeded by order(s) of magnitude the 
> ru_wallclock*slots 
> > time. Has anyone else seen this, and in what circumstances? 
> The jobs 
> > affected seem to have failed.
> >
> > Regards, Aaron Turner
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> 
> ------------------------------------------
> Dr E L  Heck
> 
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
> 
> DURHAM, DH1 3LE
> United Kingdom
> 
> e-mail: lydia.heck at durham.ac.uk
> 
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___________________________________________
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list