[GE users] Arco tool results differ from qacct

Jana Olivova Jana.Olivova at Sun.COM
Thu May 31 09:23:52 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi John,

The ju_id is not the job_id. The ju_id is a foreign key that links the 
sge_job_usage table to the parent table sge_job where it is the j_id 
primary key. The sge_job table contains the j_job_number field which is 
the actual job number.

You'll need to join the two tables to get the correct information from both.

The query you need would look as follows:

SELECT ju_mem AS "mem", ju_start_time AS "start", ju_cpu AS "ju_cpu", 
ju_hostname AS "hostname",
ju_end_time AS "end time", ju_exit_status AS "exit state", ju_maxvmem AS 
"max vmem",
ju_ru_wallclock AS "wallclock", *j_job_number AS "id*" FROM 
sge_job_usage INNER JOIN sge_job ON (ju_id = j_id)
WHERE *j_job_number = 1327

*Regards,
*
*Jana Olivova
 

John Mc-Nicholas XJ (GU/ETL) wrote:
> Hi Jana
>  
>  
> Here is the SQL STATEMENT:
> Sql:
> 	
> SELECT ju_mem AS "mem", ju_start_time AS "start", ju_id AS "id", 
> ju_cpu AS "ju_cpu", ju_hostname AS "hostname", ju_end_time AS "end 
> time", ju_exit_status AS "exit state", ju_maxvmem AS "max vmem", 
> ju_ru_wallclock AS "wallclock" FROM sge_job_usage WHERE ju_id = '1327'
>
>  
>  
>  
> Thanks
>  
> John
>  
>  
>
> ------------------------------------------------------------------------
> *From:* Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM]
> *Sent:* 25 May 2007 11:00
> *To:* users at gridengine.sunsource.net
> *Subject:* Re: [GE users] Arco tool results differ from qacct
>
> Hi John,
>
> Can you also send the exact SQL statement that you used to retrieve 
> this information from the database.
>
> Thanks
>
> Jana
>
> John Mc-Nicholas XJ (GU/ETL) wrote:
>> Hi Jana/Chansup/Daniel
>>
>> Thanks for your help so far on this issue.
>> There seems to be something very weird going on!
>> I started comparing individual jobs on the arco tool and qacct.
>>
>> I found a job that gets 2 very different sets of data depending on where
>> you look!
>> Even the date of job start/end is different! It turn out that job 1327
>> on qacct corresponds to job 1339 on ARCO!
>>
>>
>> In qacct:
>> johnick at seasub1[~]# qacct -j 1327
>> ==============================================================
>> qname        seashell.q          
>> hostname     seashell            
>> group        staff               
>> owner        etlelby             
>> project      NONE                
>> department   ELS                 
>> jobname      startdelhir11aca5   
>> jobnumber    1327                
>> taskid       undefined
>> account      sge                 
>> priority     0                   
>> qsub_time    Wed May 23 08:20:17 2007
>> start_time   Wed May 23 08:20:49 2007
>> end_time     Wed May 23 11:00:29 2007
>> granted_pe   NONE                
>> slots        1                   
>> failed       0    
>> exit_status  0                   
>> ru_wallclock 9580         
>> ru_utime     854          
>> ru_stime     59           
>> cpu          913          
>> mem          564.852           
>> io           0.000             
>> iow          0.000             
>> maxvmem      1.054G
>>
>> On Arco:
>>  
>>  
>>
>> mem	   	 start	             id	ju_cpu  hostname	end time
>> exit state max vmem  wallclock	
>> 736.666737	 2007-05-22 08:49:41.0	 1327	1323	  seashell
>> 2007-05-22 11:39:47.0	0	 1.26E+09	10206	
>>
>>
>> I figured that this data was too wrong to be the same job & sure enough
>> job 1339 matches the qacct for 1327! (apart from vmem!)
>>
>> mem	   	 start	             id	ju_cpu  hostname	end time
>> exit state max vmem  wallclock		
>> 564.852292	2007/05/23 08:20		1339	 913
>> seashell	2007/05/23 11:00	 	0	 1.13E+09
>> 9580
>>   
>>
>> What is going on?? --has my entire sql database been corrupted?
>> How can I verify this and more importantly fix it so the QACCT and ARCO
>> match.
>>
>> Kind Regards
>>
>> John
>>
>>
>>
>>
>>  
>>
>> -----Original Message-----
>> From: Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM] 
>> Sent: 21 May 2007 18:50
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Arco tool results differ from qacct
>>
>> Hi Chansup,
>>
>> Hmm, it does not look like that. The table sge_job_usage has fields
>> ju_failed, ju_exit_status and jobs that have different exit status than
>> 0 are recorded and the view_accounting does not filter those out.
>>
>> Jana
>>
>> Chansup Byun wrote:
>>   
>>> Hi Jana,
>>>
>>> I could be wrong but if I remember correctly the sge_job_usage table 
>>> in ARCO  only stores jobs completed successfully.
>>> However, qacct also stores jobs failed with errors.
>>>
>>> Regards,
>>>
>>> - Chansup
>>>
>>> Jana Olivova wrote:
>>>     
>>>> Hi,
>>>>
>>>> I don't see anything wrong with the query. You can also use the 
>>>> predefined Accounting per Department query, which does the same.
>>>>
>>>> I checked my setup with MySQL database and I get the same results 
>>>> with both ARCo and qacct. I don't have any sensible data in my 
>>>> Postgres db, because I was using the same grid with 3 different 
>>>> databases. So the only month I can compare is is this one:
>>>>
>>>> qacct -b 200705010000 -e 200705312359 Total System Usage
>>>>     WALLCLOCK         UTIME         STIME           CPU             
>>>> MEMORY                 IO                IOW
>>>> =====================================================================
>>>> ===========================================
>>>>
>>>>        889909             2            36           415              
>>>> 0.275              0.000              0.000
>>>>
>>>> ARCo Accounting per Department
>>>>
>>>> 2007-05-01
>>>> cpu     mem     io
>>>> defaultdepartment     415.155821     0.275125999999997     0.0
>>>>
>>>>
>>>> The one explanation for this, of course, would be if the same 
>>>> database is used for more grids and/or (for February) that reporting 
>>>> was not enabled the whole time. Not sure if that is a likely scenario
>>>>       
>>
>>   
>>>> for you.
>>>>
>>>> Regards,
>>>>
>>>> Jana
>>>>
>>>> John Mc-Nicholas XJ (GU/ETL) wrote:
>>>>       
>>>>> Hi Jana/Daniel
>>>>>
>>>>> In this case I use database :sge_job_usage, but I have also used the
>>>>>         
>>
>>   
>>>>> accounting database.
>>>>> qacct groups jobs according to the jobs start time? I've done the 
>>>>> same for the SQL query.
>>>>> So this SQL SHOULD TOTAL UP THE MEMORY GBS for all the jobs started 
>>>>> within each month.
>>>>>
>>>>>
>>>>> SQL:
>>>>> SELECT date_trunc('month', ju_start_time) AS month, SUM (ju_mem) AS 
>>>>> "mem "  FROM sge_job_usage WHERE ju_start_time  >  
>>>>> (current_timestamp - interval '1 year') GROUP BY month ORDER BY 
>>>>> month; resulting table
>>>>> month               mem   
>>>>> 2007-02-01 00:00:00.0 532138.750717 2007-03-01 00:00:00.0
>>>>> 5274933.144317 2007-04-01 00:00:00.0 6884688.555405 2007-05-01 
>>>>> 00:00:00.0 2789895.540273 Here are the results from qacct command. 
>>>>> Compare the MEMORY column to table above.
>>>>> The results differ by a significant amount. A query on ju_cpu 
>>>>> results in a similar discrepency.
>>>>> qacct johnick at seasub1[~]# qacct -b 200702010000 -e 200702312359 
>>>>> Total System Usage
>>>>>     WALLCLOCK         UTIME         STIME           CPU
>>>>> MEMORY                 IO                IOW
>>>>> ====================================================================
>>>>> ====
>>>>>
>>>>> ========================================
>>>>>       2433584        289462        131581        854446
>>>>> 567582.583              0.000              0.000
>>>>> johnick at seasub1[~]# qacct -b 200703010000 -e 200703312359 Total 
>>>>> System Usage
>>>>>     WALLCLOCK         UTIME         STIME           CPU
>>>>> MEMORY                 IO                IOW
>>>>> ====================================================================
>>>>> ====
>>>>>
>>>>> ========================================
>>>>>       4753132       1041297         53389       2957120
>>>>> 3923641.991              0.000              0.000
>>>>> johnick at seasub1[~]# qacct -b 200704010000 -e 200704312359 Total 
>>>>> System Usage
>>>>>     WALLCLOCK         UTIME         STIME           CPU
>>>>> MEMORY                 IO                IOW
>>>>> ====================================================================
>>>>> ====
>>>>>
>>>>> ========================================
>>>>>       6118415       2063020        140069       4094226
>>>>> 5743492.079              0.000              0.000
>>>>> johnick at seasub1[~]# qacct -b 200705010000 -e 200705312359 Total 
>>>>> System Usage
>>>>>     WALLCLOCK         UTIME         STIME           CPU
>>>>> MEMORY                 IO                IOW
>>>>> ====================================================================
>>>>> ====
>>>>>
>>>>> ========================================
>>>>>       2746486        983188        156462       1761848
>>>>> 2388992.294              0.000              0.000
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM] Sent: 18 
>>>>> May 2007 18:58
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] Arco tool results differ from qacct
>>>>>
>>>>> I have problem replicating the issue, though. I keep running jobs 
>>>>> (using Maintrunk GE) and the numbers keep matching.
>>>>>
>>>>> Jana
>>>>>
>>>>> Daniel Templeton wrote:
>>>>>  
>>>>>         
>>>>>> It may be worth noting that qacct and ARCo use different source 
>>>>>> data files.  qacct uses the accounting file, and ARCo uses the 
>>>>>> reporting file.  It is not inconceivable that there could be an 
>>>>>> issue such that the qmaster might write different data to the two 
>>>>>> files in some cases.
>>>>>>     
>>>>>>           
>>>>>  
>>>>>         
>>>>>> Just a thought.
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> Jana Olivova wrote:
>>>>>>    
>>>>>>           
>>>>>>> Hi John,
>>>>>>>
>>>>>>> I could check on the Arco side. I have checked my data and they 
>>>>>>> are both the same, except the rounding that appears in qacct. I do
>>>>>>>             
>>
>>   
>>>>>>> have, however, very small sample of data. Frankly, I am not sure 
>>>>>>> what would
>>>>>>>       
>>>>>>>             
>>>>>  
>>>>>         
>>>>>>> cause this. Arco only inserts the data that is given to it by the 
>>>>>>> qmaster, in the reporting file.
>>>>>>>
>>>>>>> Can you tell me what sql query did you use to obtain the data in 
>>>>>>> ARCo
>>>>>>>       
>>>>>>>             
>>>>>  
>>>>>         
>>>>>>> and what database are you using?
>>>>>>>
>>>>>>> Jana Olivova
>>>>>>>
>>>>>>> John Mc-Nicholas XJ (GU/ETL) wrote:
>>>>>>>      
>>>>>>>             
>>>>>>>> Hi All
>>>>>>>>
>>>>>>>> I am basically having the same problem that Todd Heywood had 
>>>>>>>> earlier
>>>>>>>>         
>>>>>>>>               
>>>>>  
>>>>>         
>>>>>>>> in the year.
>>>>>>>> He gave up on Arco tool in the end , I hope I haven't got to do 
>>>>>>>> the same.
>>>>>>>>
>>>>>>>>        
>>>>>>>>               
>>>>>>>>> / Heywood, Todd wrote:/ >/> How does ACRo report time and 
>>>>>>>>> memory? I
>>>>>>>>>           
>>>>>>>>>                 
>>>>>>>> assumed it would be the same as/ >/> for qacct, for which it is 
>>>>>>>> seconds and Gbytes (according to "man/ >/> accounting"). But 
>>>>>>>> qacct and ACRo are reporting different numbers. Unit/ >/> 
>>>>>>>> conversions don't account for the diffs/
>>>>>>>>
>>>>>>>> The Arco Tool produces nice graphs and the SQL works fine but 
>>>>>>>> when I
>>>>>>>>         
>>>>>>>>               
>>>>>  
>>>>>         
>>>>>>>> compare to the output of QACCT , it is a completely different set
>>>>>>>>               
>>
>>   
>>>>>>>> of
>>>>>>>>         
>>>>>>>>               
>>>>>  
>>>>>         
>>>>>>>> results.
>>>>>>>>
>>>>>>>> There is some correlation between the data. For example, Aprils 
>>>>>>>> usage is the highest in both sets of results & The users with the
>>>>>>>>               
>>
>>   
>>>>>>>> most usage also correspond in both sets of data.
>>>>>>>> But the actual data seems to be randomly out by an order of
>>>>>>>>               
>> 20-30%.
>>   
>>>>>>>> I'm specifically trying to extract grid jobs memory (Gigabyte
>>>>>>>> seconds) per month
>>>>>>>> For example the data for April
>>>>>>>> qacct -b 200704010000 -e 200704312359 MEMORY 5743492.079
>>>>>>>>
>>>>>>>> But the output in arco gives.........
>>>>>>>> 6324866.240448
>>>>>>>>
>>>>>>>> Is this a bug in ARCO/GRID ?
>>>>>>>> What would cause this behaviour?
>>>>>>>>
>>>>>>>> The only strange thing I've noticed is that I have 2 dbwriter 
>>>>>>>> process instead of 1 & 5 postmaster instead of 3.
>>>>>>>>
>>>>>>>>
>>>>>>>> sgeadm 1430 1422 0 May 10 ? 0:00 /bin/sh 
>>>>>>>> /grid/dbwriter/util/dbwriter.sh sgeadm 1422 1 0 May 10 ? 0:00 
>>>>>>>> /bin/sh /grid/dbwriter/util/dbwriter.sh postgres 1402 1401 0 May 
>>>>>>>> 10 ? 0:00 /usr/local/pgsql/bin/postmaster -D 
>>>>>>>> /usr/local/pgsql/database -S postgres 1403 1402 0 May 10 ? 0:01 
>>>>>>>> /usr/local/pgsql/bin/postmaster -D /usr/local/pgsql/database -S 
>>>>>>>> postgres 1401 1 0 May 10 ? 0:04 /usr/local/pgsql/bin/postmaster 
>>>>>>>> -D /usr/local/pgsql/database -S postgres 13303 1401 0 16:29:34 ?
>>>>>>>> 0:00 /usr/local/pgsql/bin/postmaster -D /usr/local/pgsql/database
>>>>>>>>               
>>
>>   
>>>>>>>> -S postgres 9719 1401 0 14:31:33 ? 0:20 
>>>>>>>> /usr/local/pgsql/bin/postmaster
>>>>>>>>         
>>>>>>>>               
>>>>>  
>>>>>         
>>>>>>>> -D /usr/local/pgsql/database -S
>>>>>>>>
>>>>>>>> If you've any ideas please get back to me & I'll give you more 
>>>>>>>> detailed info.
>>>>>>>>
>>>>>>>> Best Regards
>>>>>>>>
>>>>>>>> John
>>>>>>>> */ John Mc Nicholas /*
>>>>>>>>
>>>>>>>> * STE/SEA Support Engineer *
>>>>>>>> * BETE Test Plants UK *
>>>>>>>> E
>>>>>>>>
>>>>>>>> Phone: +44 (0) 1483 305458
>>>>>>>> Email: john.xj.mc-nicholas at ericsson.com
>>>>>>>> Address: Ericsson, Midleton Gate, Guildford Business Park, 
>>>>>>>> Guildford, Surrey, GU2 8SG , UK
>>>>>>>>
>>>>>>>> / Ericsson Limited /
>>>>>>>> / Registered Office: Unit 4, Midleton Gate, Guildford Business 
>>>>>>>> Park,
>>>>>>>>         
>>>>>>>>               
>>>>>  
>>>>>         
>>>>>>>> Guildford, Surrey, GU2 8SG / / Registered Number in England and
>>>>>>>> Wales: 942215 / / This communication is confidential and intended
>>>>>>>>               
>>
>>   
>>>>>>>> solely for the addressee(s). Any unauthorised review, use, 
>>>>>>>> disclosure or distribution is prohibited. If you believe this 
>>>>>>>> message has been sent to you in error, please notify the sender 
>>>>>>>> by replying to this transmission and delete the message without 
>>>>>>>> disclosing it. Thank you.
>>>>>>>> Ericsson Limited does not enter into contracts or contractual 
>>>>>>>> obligations via electronic mail, unless otherwise agreed in 
>>>>>>>> writing between the parties concerned.
>>>>>>>> E-mail including attachments is susceptible to data corruption, 
>>>>>>>> interruption, unauthorised amendment, tampering and viruses, and 
>>>>>>>> we only send and receive e-mails on the basis that we are not 
>>>>>>>> liable for any such corruption, interception, amendment, 
>>>>>>>> tampering or viruses or any consequences thereof. /
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>         
>>>>>>>>               
>>>>>>> ------------------------------------------------------------------
>>>>>>> ---
>>>>>>> ---
>>>>>>>
>>>>>>> ------------------------------------------------------------------
>>>>>>> --- To unsubscribe, e-mail: 
>>>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: 
>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>         
>>>>>>>             
>>>>>> -------------------------------------------------------------------
>>>>>> -- To unsubscribe, e-mail: 
>>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: 
>>>>>> users-help at gridengine.sunsource.net
>>>>>>
>>>>>>     
>>>>>>           
>>>>> --------------------------------------------------------------------
>>>>> - To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>   
>>>>>         
>>>> ---------------------------------------------------------------------
>>>> ---
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>   
>>>>       
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>     
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>   
>



    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list