[GE users] Arco tool results differ from qacct

John Mc-Nicholas XJ (GU/ETL) john.xj.mc-nicholas at ericsson.com
Thu Jun 7 17:41:56 BST 2007


Hi Jana
qacct & arco correspond now .
Thanks for your help on this.
The 

________________________________

From: Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM] 
Sent: 31 May 2007 09:24
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Arco tool results differ from qacct


Hi John,

The ju_id is not the job_id. The ju_id is a foreign key that links the
sge_job_usage table to the parent table sge_job where it is the j_id
primary key. The sge_job table contains the j_job_number field which is
the actual job number. 

You'll need to join the two tables to get the correct information from
both.

The query you need would look as follows:

SELECT ju_mem AS "mem", ju_start_time AS "start", ju_cpu AS "ju_cpu",
ju_hostname AS "hostname", 
ju_end_time AS "end time", ju_exit_status AS "exit state", ju_maxvmem AS
"max vmem", 
ju_ru_wallclock AS "wallclock", j_job_number AS "id" FROM sge_job_usage
INNER JOIN sge_job ON (ju_id = j_id) 
WHERE j_job_number = 1327

Regards,

Jana Olivova
 

John Mc-Nicholas XJ (GU/ETL) wrote: 

	Hi Jana
	 
	 
	Here is the SQL STATEMENT:
Sql:
SELECT ju_mem AS "mem", ju_start_time AS "start", ju_id AS "id", ju_cpu
AS "ju_cpu", ju_hostname AS "hostname", ju_end_time AS "end time",
ju_exit_status AS "exit state", ju_maxvmem AS "max vmem",
ju_ru_wallclock AS "wallclock" FROM sge_job_usage WHERE ju_id = '1327'
	 
	 
	 
	Thanks
	 
	John
	 
	 

________________________________

	From: Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM] 
	Sent: 25 May 2007 11:00
	To: users at gridengine.sunsource.net
	Subject: Re: [GE users] Arco tool results differ from qacct
	
	
	Hi John,
	
	Can you also send the exact SQL statement that you used to
retrieve this information from the database.
	
	Thanks
	
	Jana
	
	John Mc-Nicholas XJ (GU/ETL) wrote: 

		Hi Jana/Chansup/Daniel
		
		Thanks for your help so far on this issue.
		There seems to be something very weird going on!
		I started comparing individual jobs on the arco tool and
qacct.
		
		I found a job that gets 2 very different sets of data
depending on where
		you look!
		Even the date of job start/end is different! It turn out
that job 1327
		on qacct corresponds to job 1339 on ARCO!
		
		
		In qacct:
		johnick at seasub1[~]# qacct -j 1327
	
==============================================================
		qname        seashell.q          
		hostname     seashell            
		group        staff               
		owner        etlelby             
		project      NONE                
		department   ELS                 
		jobname      startdelhir11aca5   
		jobnumber    1327                
		taskid       undefined
		account      sge                 
		priority     0                   
		qsub_time    Wed May 23 08:20:17 2007
		start_time   Wed May 23 08:20:49 2007
		end_time     Wed May 23 11:00:29 2007
		granted_pe   NONE                
		slots        1                   
		failed       0    
		exit_status  0                   
		ru_wallclock 9580         
		ru_utime     854          
		ru_stime     59           
		cpu          913          
		mem          564.852           
		io           0.000             
		iow          0.000             
		maxvmem      1.054G
		
		On Arco:
		 
		 
		
		mem	   	 start	             id	ju_cpu  hostname
end time
		exit state max vmem  wallclock	
		736.666737	 2007-05-22 08:49:41.0	 1327	1323
seashell
		2007-05-22 11:39:47.0	0	 1.26E+09	10206	
		
		
		I figured that this data was too wrong to be the same
job & sure enough
		job 1339 matches the qacct for 1327! (apart from vmem!)
		
		mem	   	 start	             id	ju_cpu  hostname
end time
		exit state max vmem  wallclock		
		564.852292	2007/05/23 08:20		1339
913
		seashell	2007/05/23 11:00	 	0
1.13E+09
		9580
		  
		
		What is going on?? --has my entire sql database been
corrupted?
		How can I verify this and more importantly fix it so the
QACCT and ARCO
		match.
		
		Kind Regards
		
		John
		
		
		
		
		 
		
		-----Original Message-----
		From: Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM]

		Sent: 21 May 2007 18:50
		To: users at gridengine.sunsource.net
		Subject: Re: [GE users] Arco tool results differ from
qacct
		
		Hi Chansup,
		
		Hmm, it does not look like that. The table sge_job_usage
has fields
		ju_failed, ju_exit_status and jobs that have different
exit status than
		0 are recorded and the view_accounting does not filter
those out.
		
		Jana
		
		Chansup Byun wrote:
		  

			Hi Jana,
			
			I could be wrong but if I remember correctly the
sge_job_usage table 
			in ARCO  only stores jobs completed
successfully.
			However, qacct also stores jobs failed with
errors.
			
			Regards,
			
			- Chansup
			
			Jana Olivova wrote:
			    

				Hi,
				
				I don't see anything wrong with the
query. You can also use the 
				predefined Accounting per Department
query, which does the same.
				
				I checked my setup with MySQL database
and I get the same results 
				with both ARCo and qacct. I don't have
any sensible data in my 
				Postgres db, because I was using the
same grid with 3 different 
				databases. So the only month I can
compare is is this one:
				
				qacct -b 200705010000 -e 200705312359
Total System Usage
				    WALLCLOCK         UTIME
STIME           CPU             
				MEMORY                 IO
IOW
	
=====================================================================
	
===========================================
				
				       889909             2
36           415              
				0.275              0.000
0.000
				
				ARCo Accounting per Department
				
				2007-05-01
				cpu     mem     io
				defaultdepartment     415.155821
0.275125999999997     0.0
				
				
				The one explanation for this, of course,
would be if the same 
				database is used for more grids and/or
(for February) that reporting 
				was not enabled the whole time. Not sure
if that is a likely scenario
				      

		
		  

				for you.
				
				Regards,
				
				Jana
				
				John Mc-Nicholas XJ (GU/ETL) wrote:
				      

				Hi Jana/Daniel
				
				In this case I use database
:sge_job_usage, but I have also used the
				        

		
		  

				accounting database.
				qacct groups jobs according to the jobs
start time? I've done the 
				same for the SQL query.
				So this SQL SHOULD TOTAL UP THE MEMORY
GBS for all the jobs started 
				within each month.
				
				
				SQL:
				SELECT date_trunc('month',
ju_start_time) AS month, SUM (ju_mem) AS 
				"mem "  FROM sge_job_usage WHERE
ju_start_time  >  
				(current_timestamp - interval '1 year')
GROUP BY month ORDER BY 
				month; resulting table
				month               mem   
				2007-02-01 00:00:00.0 532138.750717
2007-03-01 00:00:00.0
				5274933.144317 2007-04-01 00:00:00.0
6884688.555405 2007-05-01 
				00:00:00.0 2789895.540273 Here are the
results from qacct command. 
				Compare the MEMORY column to table
above.
				The results differ by a significant
amount. A query on ju_cpu 
				results in a similar discrepency.
				qacct johnick at seasub1[~]# qacct -b
200702010000 -e 200702312359 
				Total System Usage
				    WALLCLOCK         UTIME
STIME           CPU
				MEMORY                 IO
IOW
	
====================================================================
				====
				
				========================================
				      2433584        289462
131581        854446
				567582.583              0.000
0.000
				johnick at seasub1[~]# qacct -b
200703010000 -e 200703312359 Total 
				System Usage
				    WALLCLOCK         UTIME
STIME           CPU
				MEMORY                 IO
IOW
	
====================================================================
				====
				
				========================================
				      4753132       1041297
53389       2957120
				3923641.991              0.000
0.000
				johnick at seasub1[~]# qacct -b
200704010000 -e 200704312359 Total 
				System Usage
				    WALLCLOCK         UTIME
STIME           CPU
				MEMORY                 IO
IOW
	
====================================================================
				====
				
				========================================
				      6118415       2063020
140069       4094226
				5743492.079              0.000
0.000
				johnick at seasub1[~]# qacct -b
200705010000 -e 200705312359 Total 
				System Usage
				    WALLCLOCK         UTIME
STIME           CPU
				MEMORY                 IO
IOW
	
====================================================================
				====
				
				========================================
				      2746486        983188
156462       1761848
				2388992.294              0.000
0.000
				
				
				
				-----Original Message-----
				From: Jana.Olivova at Sun.COM
[mailto:Jana.Olivova at Sun.COM] Sent: 18 
				May 2007 18:58
				To: users at gridengine.sunsource.net
				Subject: Re: [GE users] Arco tool
results differ from qacct
				
				I have problem replicating the issue,
though. I keep running jobs 
				(using Maintrunk GE) and the numbers
keep matching.
				
				Jana
				
				Daniel Templeton wrote:
				 
				        

				It may be worth noting that qacct and
ARCo use different source 
				data files.  qacct uses the accounting
file, and ARCo uses the 
				reporting file.  It is not inconceivable
that there could be an 
				issue such that the qmaster might write
different data to the two 
				files in some cases.
				    
				          

				 
				        

				Just a thought.
				
				Daniel
				
				Jana Olivova wrote:
				   
				          

				Hi John,
				
				I could check on the Arco side. I have
checked my data and they 
				are both the same, except the rounding
that appears in qacct. I do
				            

		
		  

				have, however, very small sample of
data. Frankly, I am not sure 
				what would
				      
				            

				 
				        

				cause this. Arco only inserts the data
that is given to it by the 
				qmaster, in the reporting file.
				
				Can you tell me what sql query did you
use to obtain the data in 
				ARCo
				      
				            

				 
				        

				and what database are you using?
				
				Jana Olivova
				
				John Mc-Nicholas XJ (GU/ETL) wrote:
				     
				            

				Hi All
				
				I am basically having the same problem
that Todd Heywood had 
				earlier
				        
				              

				 
				        

				in the year.
				He gave up on Arco tool in the end , I
hope I haven't got to do 
				the same.
				
				       
				              

				/ Heywood, Todd wrote:/ >/> How does
ACRo report time and 
				memory? I
				          
				                

				assumed it would be the same as/ >/> for
qacct, for which it is 
				seconds and Gbytes (according to "man/
>/> accounting"). But 
				qacct and ACRo are reporting different
numbers. Unit/ >/> 
				conversions don't account for the diffs/
				
				The Arco Tool produces nice graphs and
the SQL works fine but 
				when I
				        
				              

				 
				        

				compare to the output of QACCT , it is a
completely different set
				              

		
		  

				of
				        
				              

				 
				        

				results.
				
				There is some correlation between the
data. For example, Aprils 
				usage is the highest in both sets of
results & The users with the
				              

		
		  

				most usage also correspond in both sets
of data.
				But the actual data seems to be randomly
out by an order of
				              

		20-30%.
		  

				I'm specifically trying to extract grid
jobs memory (Gigabyte
				seconds) per month
				For example the data for April
				qacct -b 200704010000 -e 200704312359
MEMORY 5743492.079
				
				But the output in arco gives.........
				6324866.240448
				
				Is this a bug in ARCO/GRID ?
				What would cause this behaviour?
				
				The only strange thing I've noticed is
that I have 2 dbwriter 
				process instead of 1 & 5 postmaster
instead of 3.
				
				
				sgeadm 1430 1422 0 May 10 ? 0:00 /bin/sh

				/grid/dbwriter/util/dbwriter.sh sgeadm
1422 1 0 May 10 ? 0:00 
				/bin/sh /grid/dbwriter/util/dbwriter.sh
postgres 1402 1401 0 May 
				10 ? 0:00
/usr/local/pgsql/bin/postmaster -D 
				/usr/local/pgsql/database -S postgres
1403 1402 0 May 10 ? 0:01 
				/usr/local/pgsql/bin/postmaster -D
/usr/local/pgsql/database -S 
				postgres 1401 1 0 May 10 ? 0:04
/usr/local/pgsql/bin/postmaster 
				-D /usr/local/pgsql/database -S postgres
13303 1401 0 16:29:34 ?
				0:00 /usr/local/pgsql/bin/postmaster -D
/usr/local/pgsql/database
				              

		
		  

				-S postgres 9719 1401 0 14:31:33 ? 0:20 
				/usr/local/pgsql/bin/postmaster
				        
				              

				 
				        

				-D /usr/local/pgsql/database -S
				
				If you've any ideas please get back to
me & I'll give you more 
				detailed info.
				
				Best Regards
				
				John
				*/ John Mc Nicholas /*
				
				* STE/SEA Support Engineer *
				* BETE Test Plants UK *
				E
				
				Phone: +44 (0) 1483 305458
				Email: john.xj.mc-nicholas at ericsson.com
				Address: Ericsson, Midleton Gate,
Guildford Business Park, 
				Guildford, Surrey, GU2 8SG , UK
				
				/ Ericsson Limited /
				/ Registered Office: Unit 4, Midleton
Gate, Guildford Business 
				Park,
				        
				              

				 
				        

				Guildford, Surrey, GU2 8SG / /
Registered Number in England and
				Wales: 942215 / / This communication is
confidential and intended
				              

		
		  

				solely for the addressee(s). Any
unauthorised review, use, 
				disclosure or distribution is
prohibited. If you believe this 
				message has been sent to you in error,
please notify the sender 
				by replying to this transmission and
delete the message without 
				disclosing it. Thank you.
				Ericsson Limited does not enter into
contracts or contractual 
				obligations via electronic mail, unless
otherwise agreed in 
				writing between the parties concerned.
				E-mail including attachments is
susceptible to data corruption, 
				interruption, unauthorised amendment,
tampering and viruses, and 
				we only send and receive e-mails on the
basis that we are not 
				liable for any such corruption,
interception, amendment, 
				tampering or viruses or any consequences
thereof. /
				
				
				
				        
				              

	
------------------------------------------------------------------
				---
				---
				
	
------------------------------------------------------------------
				--- To unsubscribe, e-mail: 
	
users-unsubscribe at gridengine.sunsource.net
				For additional commands, e-mail: 
				users-help at gridengine.sunsource.net
				        
				            

	
-------------------------------------------------------------------
				-- To unsubscribe, e-mail: 
	
users-unsubscribe at gridengine.sunsource.net
				For additional commands, e-mail: 
				users-help at gridengine.sunsource.net
				
				    
				          

	
--------------------------------------------------------------------
				- To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
				For additional commands, e-mail:
users-help at gridengine.sunsource.net
				
				  
				        

	
---------------------------------------------------------------------
				---
				
	
---------------------------------------------------------------------
				To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
				For additional commands, e-mail:
users-help at gridengine.sunsource.net
				  
				      

	
---------------------------------------------------------------------
			To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
			For additional commands, e-mail:
users-help at gridengine.sunsource.net
			
			    

		
		
	
---------------------------------------------------------------------
		To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
		For additional commands, e-mail:
users-help at gridengine.sunsource.net
		
		  






More information about the gridengine-users mailing list