[GE users] qacct ambiguities

Ross Dickson Ross.Dickson at dal.ca
Wed Nov 28 19:27:18 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I have two questions about qacct output.

(1) Why do I sometimes see multiple records for a single job when using 
"qacct -j <jobid>"?   And why do many of these have a bogus qsub_time 
(Wed Dec 31 20:00:00 1969)?

(2)  How should I interpret the WALLCLOCK and CPU times  returned by 
qacct?  Consider:

 > qacct -d 30 -pe
PE       WALLCLOCK       UTIME     STIME        CPU         MEMORY
==================================================================
NONE        896093      125519        40     131683      70485.870
cre        6362816     5785288   1790762   14617130    9039932.662
mpich     25070095    21679280     11122   21690514    4943072.822
openmp     2586960     5213138      4152    9114528   15434070.616

Looking at the CRE parallel environment I see the ratio of CPU to wall 
clock time is about 2.3, which suggests to me that the wallclock time is 
just end time minus start time, with no slot count factored in.  The 
OpenMP figures show about 3.5 as much CPU as wall time, leading to the 
same conclusion. 

However, for MPICH I see *more* wall time than CPU, which suggests either
  (a) we have a lot of MPICH jobs sitting around idling, or
  (b) the wall time reported for this parallel environment is multiplied 
by the slot count (or summed over slots), contradicting the conclusions 
above, or
  (c) the MPICH CPU total does *not* include all the slots.

If (c) is the case, then we're making a mistake using CPU time in our 
usage accounting aren't we?  Are we seriously undercounting the MPI CPU 
usage?

This example is from SGE 6.0u7 running on Solaris, although I've seen 
similar mysteries on our Linux clusters as well, and on a machine which 
was recently upgraded from 6.0u7 to 6.1u2.


-- 
Ross Dickson         HPC Consultant
ACEnet               http://www.ace-net.ca
+1 902 494 6710      Skype: ross.m.dickson

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list