[GE users] Arco tool results differ from qacct

John Mc-Nicholas XJ (GU/ETL) john.xj.mc-nicholas at ericsson.com
Wed May 23 17:41:35 BST 2007


Hi Jana/Chansup/Daniel

Thanks for your help so far on this issue.
There seems to be something very weird going on!
I started comparing individual jobs on the arco tool and qacct.

I found a job that gets 2 very different sets of data depending on where
you look!
Even the date of job start/end is different! It turn out that job 1327
on qacct corresponds to job 1339 on ARCO!


In qacct:
johnick at seasub1[~]# qacct -j 1327
==============================================================
qname        seashell.q          
hostname     seashell            
group        staff               
owner        etlelby             
project      NONE                
department   ELS                 
jobname      startdelhir11aca5   
jobnumber    1327                
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Wed May 23 08:20:17 2007
start_time   Wed May 23 08:20:49 2007
end_time     Wed May 23 11:00:29 2007
granted_pe   NONE                
slots        1                   
failed       0    
exit_status  0                   
ru_wallclock 9580         
ru_utime     854          
ru_stime     59           
cpu          913          
mem          564.852           
io           0.000             
iow          0.000             
maxvmem      1.054G

On Arco:
 
 

mem	   	 start	             id	ju_cpu  hostname	end time
exit state max vmem  wallclock	
736.666737	 2007-05-22 08:49:41.0	 1327	1323	  seashell
2007-05-22 11:39:47.0	0	 1.26E+09	10206	


I figured that this data was too wrong to be the same job & sure enough
job 1339 matches the qacct for 1327! (apart from vmem!)

mem	   	 start	             id	ju_cpu  hostname	end time
exit state max vmem  wallclock		
564.852292	2007/05/23 08:20		1339	 913
seashell	2007/05/23 11:00	 	0	 1.13E+09
9580
  

What is going on?? --has my entire sql database been corrupted?
How can I verify this and more importantly fix it so the QACCT and ARCO
match.

Kind Regards

John




 

-----Original Message-----
From: Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM] 
Sent: 21 May 2007 18:50
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Arco tool results differ from qacct

Hi Chansup,

Hmm, it does not look like that. The table sge_job_usage has fields
ju_failed, ju_exit_status and jobs that have different exit status than
0 are recorded and the view_accounting does not filter those out.

Jana

Chansup Byun wrote:
> Hi Jana,
>
> I could be wrong but if I remember correctly the sge_job_usage table 
> in ARCO  only stores jobs completed successfully.
> However, qacct also stores jobs failed with errors.
>
> Regards,
>
> - Chansup
>
> Jana Olivova wrote:
>> Hi,
>>
>> I don't see anything wrong with the query. You can also use the 
>> predefined Accounting per Department query, which does the same.
>>
>> I checked my setup with MySQL database and I get the same results 
>> with both ARCo and qacct. I don't have any sensible data in my 
>> Postgres db, because I was using the same grid with 3 different 
>> databases. So the only month I can compare is is this one:
>>
>> qacct -b 200705010000 -e 200705312359 Total System Usage
>>     WALLCLOCK         UTIME         STIME           CPU             
>> MEMORY                 IO                IOW
>> =====================================================================
>> ===========================================
>>
>>        889909             2            36           415              
>> 0.275              0.000              0.000
>>
>> ARCo Accounting per Department
>>
>> 2007-05-01
>> cpu     mem     io
>> defaultdepartment     415.155821     0.275125999999997     0.0
>>
>>
>> The one explanation for this, of course, would be if the same 
>> database is used for more grids and/or (for February) that reporting 
>> was not enabled the whole time. Not sure if that is a likely scenario

>> for you.
>>
>> Regards,
>>
>> Jana
>>
>> John Mc-Nicholas XJ (GU/ETL) wrote:
>>> Hi Jana/Daniel
>>>
>>> In this case I use database :sge_job_usage, but I have also used the

>>> accounting database.
>>> qacct groups jobs according to the jobs start time? I've done the 
>>> same for the SQL query.
>>> So this SQL SHOULD TOTAL UP THE MEMORY GBS for all the jobs started 
>>> within each month.
>>>
>>>
>>> SQL:
>>> SELECT date_trunc('month', ju_start_time) AS month, SUM (ju_mem) AS 
>>> "mem "  FROM sge_job_usage WHERE ju_start_time  >  
>>> (current_timestamp - interval '1 year') GROUP BY month ORDER BY 
>>> month; resulting table
>>> month               mem   
>>> 2007-02-01 00:00:00.0 532138.750717 2007-03-01 00:00:00.0
>>> 5274933.144317 2007-04-01 00:00:00.0 6884688.555405 2007-05-01 
>>> 00:00:00.0 2789895.540273 Here are the results from qacct command. 
>>> Compare the MEMORY column to table above.
>>> The results differ by a significant amount. A query on ju_cpu 
>>> results in a similar discrepency.
>>> qacct johnick at seasub1[~]# qacct -b 200702010000 -e 200702312359 
>>> Total System Usage
>>>     WALLCLOCK         UTIME         STIME           CPU
>>> MEMORY                 IO                IOW
>>> ====================================================================
>>> ====
>>>
>>> ========================================
>>>       2433584        289462        131581        854446
>>> 567582.583              0.000              0.000
>>> johnick at seasub1[~]# qacct -b 200703010000 -e 200703312359 Total 
>>> System Usage
>>>     WALLCLOCK         UTIME         STIME           CPU
>>> MEMORY                 IO                IOW
>>> ====================================================================
>>> ====
>>>
>>> ========================================
>>>       4753132       1041297         53389       2957120
>>> 3923641.991              0.000              0.000
>>> johnick at seasub1[~]# qacct -b 200704010000 -e 200704312359 Total 
>>> System Usage
>>>     WALLCLOCK         UTIME         STIME           CPU
>>> MEMORY                 IO                IOW
>>> ====================================================================
>>> ====
>>>
>>> ========================================
>>>       6118415       2063020        140069       4094226
>>> 5743492.079              0.000              0.000
>>> johnick at seasub1[~]# qacct -b 200705010000 -e 200705312359 Total 
>>> System Usage
>>>     WALLCLOCK         UTIME         STIME           CPU
>>> MEMORY                 IO                IOW
>>> ====================================================================
>>> ====
>>>
>>> ========================================
>>>       2746486        983188        156462       1761848
>>> 2388992.294              0.000              0.000
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM] Sent: 18 
>>> May 2007 18:58
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] Arco tool results differ from qacct
>>>
>>> I have problem replicating the issue, though. I keep running jobs 
>>> (using Maintrunk GE) and the numbers keep matching.
>>>
>>> Jana
>>>
>>> Daniel Templeton wrote:
>>>  
>>>> It may be worth noting that qacct and ARCo use different source 
>>>> data files.  qacct uses the accounting file, and ARCo uses the 
>>>> reporting file.  It is not inconceivable that there could be an 
>>>> issue such that the qmaster might write different data to the two 
>>>> files in some cases.
>>>>     
>>>
>>>  
>>>> Just a thought.
>>>>
>>>> Daniel
>>>>
>>>> Jana Olivova wrote:
>>>>    
>>>>> Hi John,
>>>>>
>>>>> I could check on the Arco side. I have checked my data and they 
>>>>> are both the same, except the rounding that appears in qacct. I do

>>>>> have, however, very small sample of data. Frankly, I am not sure 
>>>>> what would
>>>>>       
>>>
>>>  
>>>>> cause this. Arco only inserts the data that is given to it by the 
>>>>> qmaster, in the reporting file.
>>>>>
>>>>> Can you tell me what sql query did you use to obtain the data in 
>>>>> ARCo
>>>>>       
>>>
>>>  
>>>>> and what database are you using?
>>>>>
>>>>> Jana Olivova
>>>>>
>>>>> John Mc-Nicholas XJ (GU/ETL) wrote:
>>>>>      
>>>>>> Hi All
>>>>>>
>>>>>> I am basically having the same problem that Todd Heywood had 
>>>>>> earlier
>>>>>>         
>>>
>>>  
>>>>>> in the year.
>>>>>> He gave up on Arco tool in the end , I hope I haven't got to do 
>>>>>> the same.
>>>>>>
>>>>>>        
>>>>>>> / Heywood, Todd wrote:/ >/> How does ACRo report time and 
>>>>>>> memory? I
>>>>>>>           
>>>>>> assumed it would be the same as/ >/> for qacct, for which it is 
>>>>>> seconds and Gbytes (according to "man/ >/> accounting"). But 
>>>>>> qacct and ACRo are reporting different numbers. Unit/ >/> 
>>>>>> conversions don't account for the diffs/
>>>>>>
>>>>>> The Arco Tool produces nice graphs and the SQL works fine but 
>>>>>> when I
>>>>>>         
>>>
>>>  
>>>>>> compare to the output of QACCT , it is a completely different set

>>>>>> of
>>>>>>         
>>>
>>>  
>>>>>> results.
>>>>>>
>>>>>> There is some correlation between the data. For example, Aprils 
>>>>>> usage is the highest in both sets of results & The users with the

>>>>>> most usage also correspond in both sets of data.
>>>>>> But the actual data seems to be randomly out by an order of
20-30%.
>>>>>>
>>>>>> I'm specifically trying to extract grid jobs memory (Gigabyte
>>>>>> seconds) per month
>>>>>> For example the data for April
>>>>>> qacct -b 200704010000 -e 200704312359 MEMORY 5743492.079
>>>>>>
>>>>>> But the output in arco gives.........
>>>>>> 6324866.240448
>>>>>>
>>>>>> Is this a bug in ARCO/GRID ?
>>>>>> What would cause this behaviour?
>>>>>>
>>>>>> The only strange thing I've noticed is that I have 2 dbwriter 
>>>>>> process instead of 1 & 5 postmaster instead of 3.
>>>>>>
>>>>>>
>>>>>> sgeadm 1430 1422 0 May 10 ? 0:00 /bin/sh 
>>>>>> /grid/dbwriter/util/dbwriter.sh sgeadm 1422 1 0 May 10 ? 0:00 
>>>>>> /bin/sh /grid/dbwriter/util/dbwriter.sh postgres 1402 1401 0 May 
>>>>>> 10 ? 0:00 /usr/local/pgsql/bin/postmaster -D 
>>>>>> /usr/local/pgsql/database -S postgres 1403 1402 0 May 10 ? 0:01 
>>>>>> /usr/local/pgsql/bin/postmaster -D /usr/local/pgsql/database -S 
>>>>>> postgres 1401 1 0 May 10 ? 0:04 /usr/local/pgsql/bin/postmaster 
>>>>>> -D /usr/local/pgsql/database -S postgres 13303 1401 0 16:29:34 ?
>>>>>> 0:00 /usr/local/pgsql/bin/postmaster -D /usr/local/pgsql/database

>>>>>> -S postgres 9719 1401 0 14:31:33 ? 0:20 
>>>>>> /usr/local/pgsql/bin/postmaster
>>>>>>         
>>>
>>>  
>>>>>> -D /usr/local/pgsql/database -S
>>>>>>
>>>>>> If you've any ideas please get back to me & I'll give you more 
>>>>>> detailed info.
>>>>>>
>>>>>> Best Regards
>>>>>>
>>>>>> John
>>>>>> */ John Mc Nicholas /*
>>>>>>
>>>>>> * STE/SEA Support Engineer *
>>>>>> * BETE Test Plants UK *
>>>>>> E
>>>>>>
>>>>>> Phone: +44 (0) 1483 305458
>>>>>> Email: john.xj.mc-nicholas at ericsson.com
>>>>>> Address: Ericsson, Midleton Gate, Guildford Business Park, 
>>>>>> Guildford, Surrey, GU2 8SG , UK
>>>>>>
>>>>>> / Ericsson Limited /
>>>>>> / Registered Office: Unit 4, Midleton Gate, Guildford Business 
>>>>>> Park,
>>>>>>         
>>>
>>>  
>>>>>> Guildford, Surrey, GU2 8SG / / Registered Number in England and
>>>>>> Wales: 942215 / / This communication is confidential and intended

>>>>>> solely for the addressee(s). Any unauthorised review, use, 
>>>>>> disclosure or distribution is prohibited. If you believe this 
>>>>>> message has been sent to you in error, please notify the sender 
>>>>>> by replying to this transmission and delete the message without 
>>>>>> disclosing it. Thank you.
>>>>>> Ericsson Limited does not enter into contracts or contractual 
>>>>>> obligations via electronic mail, unless otherwise agreed in 
>>>>>> writing between the parties concerned.
>>>>>> E-mail including attachments is susceptible to data corruption, 
>>>>>> interruption, unauthorised amendment, tampering and viruses, and 
>>>>>> we only send and receive e-mails on the basis that we are not 
>>>>>> liable for any such corruption, interception, amendment, 
>>>>>> tampering or viruses or any consequences thereof. /
>>>>>>
>>>>>>
>>>>>>
>>>>>>         
>>>>> ------------------------------------------------------------------
>>>>> ---
>>>>> ---
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> --- To unsubscribe, e-mail: 
>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: 
>>>>> users-help at gridengine.sunsource.net
>>>>>         
>>>> -------------------------------------------------------------------
>>>> -- To unsubscribe, e-mail: 
>>>> users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: 
>>>> users-help at gridengine.sunsource.net
>>>>
>>>>     
>>>
>>>
>>> --------------------------------------------------------------------
>>> - To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>   
>>
>> ---------------------------------------------------------------------
>> ---
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>   
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list