[GE users] Arco tool results differ from qacct

Jana Olivova Jana.Olivova at Sun.COM
Fri May 25 10:59:35 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi John,

Can you also send the exact SQL statement that you used to retrieve this 
information from the database.

Thanks

Jana

John Mc-Nicholas XJ (GU/ETL) wrote:
> Hi Jana/Chansup/Daniel
>
> Thanks for your help so far on this issue.
> There seems to be something very weird going on!
> I started comparing individual jobs on the arco tool and qacct.
>
> I found a job that gets 2 very different sets of data depending on where
> you look!
> Even the date of job start/end is different! It turn out that job 1327
> on qacct corresponds to job 1339 on ARCO!
>
>
> In qacct:
> johnick at seasub1[~]# qacct -j 1327
> ==============================================================
> qname        seashell.q          
> hostname     seashell            
> group        staff               
> owner        etlelby             
> project      NONE                
> department   ELS                 
> jobname      startdelhir11aca5   
> jobnumber    1327                
> taskid       undefined
> account      sge                 
> priority     0                   
> qsub_time    Wed May 23 08:20:17 2007
> start_time   Wed May 23 08:20:49 2007
> end_time     Wed May 23 11:00:29 2007
> granted_pe   NONE                
> slots        1                   
> failed       0    
> exit_status  0                   
> ru_wallclock 9580         
> ru_utime     854          
> ru_stime     59           
> cpu          913          
> mem          564.852           
> io           0.000             
> iow          0.000             
> maxvmem      1.054G
>
> On Arco:
>  
>  
>
> mem	   	 start	             id	ju_cpu  hostname	end time
> exit state max vmem  wallclock	
> 736.666737	 2007-05-22 08:49:41.0	 1327	1323	  seashell
> 2007-05-22 11:39:47.0	0	 1.26E+09	10206	
>
>
> I figured that this data was too wrong to be the same job & sure enough
> job 1339 matches the qacct for 1327! (apart from vmem!)
>
> mem	   	 start	             id	ju_cpu  hostname	end time
> exit state max vmem  wallclock		
> 564.852292	2007/05/23 08:20		1339	 913
> seashell	2007/05/23 11:00	 	0	 1.13E+09
> 9580
>   
>
> What is going on?? --has my entire sql database been corrupted?
> How can I verify this and more importantly fix it so the QACCT and ARCO
> match.
>
> Kind Regards
>
> John
>
>
>
>
>  
>
> -----Original Message-----
> From: Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM] 
> Sent: 21 May 2007 18:50
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Arco tool results differ from qacct
>
> Hi Chansup,
>
> Hmm, it does not look like that. The table sge_job_usage has fields
> ju_failed, ju_exit_status and jobs that have different exit status than
> 0 are recorded and the view_accounting does not filter those out.
>
> Jana
>
> Chansup Byun wrote:
>   
>> Hi Jana,
>>
>> I could be wrong but if I remember correctly the sge_job_usage table 
>> in ARCO  only stores jobs completed successfully.
>> However, qacct also stores jobs failed with errors.
>>
>> Regards,
>>
>> - Chansup
>>
>> Jana Olivova wrote:
>>     
>>> Hi,
>>>
>>> I don't see anything wrong with the query. You can also use the 
>>> predefined Accounting per Department query, which does the same.
>>>
>>> I checked my setup with MySQL database and I get the same results 
>>> with both ARCo and qacct. I don't have any sensible data in my 
>>> Postgres db, because I was using the same grid with 3 different 
>>> databases. So the only month I can compare is is this one:
>>>
>>> qacct -b 200705010000 -e 200705312359 Total System Usage
>>>     WALLCLOCK         UTIME         STIME           CPU             
>>> MEMORY                 IO                IOW
>>> =====================================================================
>>> ===========================================
>>>
>>>        889909             2            36           415              
>>> 0.275              0.000              0.000
>>>
>>> ARCo Accounting per Department
>>>
>>> 2007-05-01
>>> cpu     mem     io
>>> defaultdepartment     415.155821     0.275125999999997     0.0
>>>
>>>
>>> The one explanation for this, of course, would be if the same 
>>> database is used for more grids and/or (for February) that reporting 
>>> was not enabled the whole time. Not sure if that is a likely scenario
>>>       
>
>   
>>> for you.
>>>
>>> Regards,
>>>
>>> Jana
>>>
>>> John Mc-Nicholas XJ (GU/ETL) wrote:
>>>       
>>>> Hi Jana/Daniel
>>>>
>>>> In this case I use database :sge_job_usage, but I have also used the
>>>>         
>
>   
>>>> accounting database.
>>>> qacct groups jobs according to the jobs start time? I've done the 
>>>> same for the SQL query.
>>>> So this SQL SHOULD TOTAL UP THE MEMORY GBS for all the jobs started 
>>>> within each month.
>>>>
>>>>
>>>> SQL:
>>>> SELECT date_trunc('month', ju_start_time) AS month, SUM (ju_mem) AS 
>>>> "mem "  FROM sge_job_usage WHERE ju_start_time  >  
>>>> (current_timestamp - interval '1 year') GROUP BY month ORDER BY 
>>>> month; resulting table
>>>> month               mem   
>>>> 2007-02-01 00:00:00.0 532138.750717 2007-03-01 00:00:00.0
>>>> 5274933.144317 2007-04-01 00:00:00.0 6884688.555405 2007-05-01 
>>>> 00:00:00.0 2789895.540273 Here are the results from qacct command. 
>>>> Compare the MEMORY column to table above.
>>>> The results differ by a significant amount. A query on ju_cpu 
>>>> results in a similar discrepency.
>>>> qacct johnick at seasub1[~]# qacct -b 200702010000 -e 200702312359 
>>>> Total System Usage
>>>>     WALLCLOCK         UTIME         STIME           CPU
>>>> MEMORY                 IO                IOW
>>>> ====================================================================
>>>> ====
>>>>
>>>> ========================================
>>>>       2433584        289462        131581        854446
>>>> 567582.583              0.000              0.000
>>>> johnick at seasub1[~]# qacct -b 200703010000 -e 200703312359 Total 
>>>> System Usage
>>>>     WALLCLOCK         UTIME         STIME           CPU
>>>> MEMORY                 IO                IOW
>>>> ====================================================================
>>>> ====
>>>>
>>>> ========================================
>>>>       4753132       1041297         53389       2957120
>>>> 3923641.991              0.000              0.000
>>>> johnick at seasub1[~]# qacct -b 200704010000 -e 200704312359 Total 
>>>> System Usage
>>>>     WALLCLOCK         UTIME         STIME           CPU
>>>> MEMORY                 IO                IOW
>>>> ====================================================================
>>>> ====
>>>>
>>>> ========================================
>>>>       6118415       2063020        140069       4094226
>>>> 5743492.079              0.000              0.000
>>>> johnick at seasub1[~]# qacct -b 200705010000 -e 200705312359 Total 
>>>> System Usage
>>>>     WALLCLOCK         UTIME         STIME           CPU
>>>> MEMORY                 IO                IOW
>>>> ====================================================================
>>>> ====
>>>>
>>>> ========================================
>>>>       2746486        983188        156462       1761848
>>>> 2388992.294              0.000              0.000
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM] Sent: 18 
>>>> May 2007 18:58
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] Arco tool results differ from qacct
>>>>
>>>> I have problem replicating the issue, though. I keep running jobs 
>>>> (using Maintrunk GE) and the numbers keep matching.
>>>>
>>>> Jana
>>>>
>>>> Daniel Templeton wrote:
>>>>  
>>>>         
>>>>> It may be worth noting that qacct and ARCo use different source 
>>>>> data files.  qacct uses the accounting file, and ARCo uses the 
>>>>> reporting file.  It is not inconceivable that there could be an 
>>>>> issue such that the qmaster might write different data to the two 
>>>>> files in some cases.
>>>>>     
>>>>>           
>>>>  
>>>>         
>>>>> Just a thought.
>>>>>
>>>>> Daniel
>>>>>
>>>>> Jana Olivova wrote:
>>>>>    
>>>>>           
>>>>>> Hi John,
>>>>>>
>>>>>> I could check on the Arco side. I have checked my data and they 
>>>>>> are both the same, except the rounding that appears in qacct. I do
>>>>>>             
>
>   
>>>>>> have, however, very small sample of data. Frankly, I am not sure 
>>>>>> what would
>>>>>>       
>>>>>>             
>>>>  
>>>>         
>>>>>> cause this. Arco only inserts the data that is given to it by the 
>>>>>> qmaster, in the reporting file.
>>>>>>
>>>>>> Can you tell me what sql query did you use to obtain the data in 
>>>>>> ARCo
>>>>>>       
>>>>>>             
>>>>  
>>>>         
>>>>>> and what database are you using?
>>>>>>
>>>>>> Jana Olivova
>>>>>>
>>>>>> John Mc-Nicholas XJ (GU/ETL) wrote:
>>>>>>      
>>>>>>             
>>>>>>> Hi All
>>>>>>>
>>>>>>> I am basically having the same problem that Todd Heywood had 
>>>>>>> earlier
>>>>>>>         
>>>>>>>               
>>>>  
>>>>         
>>>>>>> in the year.
>>>>>>> He gave up on Arco tool in the end , I hope I haven't got to do 
>>>>>>> the same.
>>>>>>>
>>>>>>>        
>>>>>>>               
>>>>>>>> / Heywood, Todd wrote:/ >/> How does ACRo report time and 
>>>>>>>> memory? I
>>>>>>>>           
>>>>>>>>                 
>>>>>>> assumed it would be the same as/ >/> for qacct, for which it is 
>>>>>>> seconds and Gbytes (according to "man/ >/> accounting"). But 
>>>>>>> qacct and ACRo are reporting different numbers. Unit/ >/> 
>>>>>>> conversions don't account for the diffs/
>>>>>>>
>>>>>>> The Arco Tool produces nice graphs and the SQL works fine but 
>>>>>>> when I
>>>>>>>         
>>>>>>>               
>>>>  
>>>>         
>>>>>>> compare to the output of QACCT , it is a completely different set
>>>>>>>               
>
>   
>>>>>>> of
>>>>>>>         
>>>>>>>               
>>>>  
>>>>         
>>>>>>> results.
>>>>>>>
>>>>>>> There is some correlation between the data. For example, Aprils 
>>>>>>> usage is the highest in both sets of results & The users with the
>>>>>>>               
>
>   
>>>>>>> most usage also correspond in both sets of data.
>>>>>>> But the actual data seems to be randomly out by an order of
>>>>>>>               
> 20-30%.
>   
>>>>>>> I'm specifically trying to extract grid jobs memory (Gigabyte
>>>>>>> seconds) per month
>>>>>>> For example the data for April
>>>>>>> qacct -b 200704010000 -e 200704312359 MEMORY 5743492.079
>>>>>>>
>>>>>>> But the output in arco gives.........
>>>>>>> 6324866.240448
>>>>>>>
>>>>>>> Is this a bug in ARCO/GRID ?
>>>>>>> What would cause this behaviour?
>>>>>>>
>>>>>>> The only strange thing I've noticed is that I have 2 dbwriter 
>>>>>>> process instead of 1 & 5 postmaster instead of 3.
>>>>>>>
>>>>>>>
>>>>>>> sgeadm 1430 1422 0 May 10 ? 0:00 /bin/sh 
>>>>>>> /grid/dbwriter/util/dbwriter.sh sgeadm 1422 1 0 May 10 ? 0:00 
>>>>>>> /bin/sh /grid/dbwriter/util/dbwriter.sh postgres 1402 1401 0 May 
>>>>>>> 10 ? 0:00 /usr/local/pgsql/bin/postmaster -D 
>>>>>>> /usr/local/pgsql/database -S postgres 1403 1402 0 May 10 ? 0:01 
>>>>>>> /usr/local/pgsql/bin/postmaster -D /usr/local/pgsql/database -S 
>>>>>>> postgres 1401 1 0 May 10 ? 0:04 /usr/local/pgsql/bin/postmaster 
>>>>>>> -D /usr/local/pgsql/database -S postgres 13303 1401 0 16:29:34 ?
>>>>>>> 0:00 /usr/local/pgsql/bin/postmaster -D /usr/local/pgsql/database
>>>>>>>               
>
>   
>>>>>>> -S postgres 9719 1401 0 14:31:33 ? 0:20 
>>>>>>> /usr/local/pgsql/bin/postmaster
>>>>>>>         
>>>>>>>               
>>>>  
>>>>         
>>>>>>> -D /usr/local/pgsql/database -S
>>>>>>>
>>>>>>> If you've any ideas please get back to me & I'll give you more 
>>>>>>> detailed info.
>>>>>>>
>>>>>>> Best Regards
>>>>>>>
>>>>>>> John
>>>>>>> */ John Mc Nicholas /*
>>>>>>>
>>>>>>> * STE/SEA Support Engineer *
>>>>>>> * BETE Test Plants UK *
>>>>>>> E
>>>>>>>
>>>>>>> Phone: +44 (0) 1483 305458
>>>>>>> Email: john.xj.mc-nicholas at ericsson.com
>>>>>>> Address: Ericsson, Midleton Gate, Guildford Business Park, 
>>>>>>> Guildford, Surrey, GU2 8SG , UK
>>>>>>>
>>>>>>> / Ericsson Limited /
>>>>>>> / Registered Office: Unit 4, Midleton Gate, Guildford Business 
>>>>>>> Park,
>>>>>>>         
>>>>>>>               
>>>>  
>>>>         
>>>>>>> Guildford, Surrey, GU2 8SG / / Registered Number in England and
>>>>>>> Wales: 942215 / / This communication is confidential and intended
>>>>>>>               
>
>   
>>>>>>> solely for the addressee(s). Any unauthorised review, use, 
>>>>>>> disclosure or distribution is prohibited. If you believe this 
>>>>>>> message has been sent to you in error, please notify the sender 
>>>>>>> by replying to this transmission and delete the message without 
>>>>>>> disclosing it. Thank you.
>>>>>>> Ericsson Limited does not enter into contracts or contractual 
>>>>>>> obligations via electronic mail, unless otherwise agreed in 
>>>>>>> writing between the parties concerned.
>>>>>>> E-mail including attachments is susceptible to data corruption, 
>>>>>>> interruption, unauthorised amendment, tampering and viruses, and 
>>>>>>> we only send and receive e-mails on the basis that we are not 
>>>>>>> liable for any such corruption, interception, amendment, 
>>>>>>> tampering or viruses or any consequences thereof. /
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         
>>>>>>>               
>>>>>> ------------------------------------------------------------------
>>>>>> ---
>>>>>> ---
>>>>>>
>>>>>> ------------------------------------------------------------------
>>>>>> --- To unsubscribe, e-mail: 
>>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: 
>>>>>> users-help at gridengine.sunsource.net
>>>>>>         
>>>>>>             
>>>>> -------------------------------------------------------------------
>>>>> -- To unsubscribe, e-mail: 
>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: 
>>>>> users-help at gridengine.sunsource.net
>>>>>
>>>>>     
>>>>>           
>>>> --------------------------------------------------------------------
>>>> - To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>   
>>>>         
>>> ---------------------------------------------------------------------
>>> ---
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>   
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>     
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   



    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list