[GE users] Arco tool results differ from qacct

John Mc-Nicholas XJ (GU/ETL) john.xj.mc-nicholas at ericsson.com
Wed May 30 11:06:33 BST 2007


Hi Jana
 
 
Here is the SQL STATEMENT:
Sql:
SELECT ju_mem AS "mem", ju_start_time AS "start", ju_id AS "id", ju_cpu
AS "ju_cpu", ju_hostname AS "hostname", ju_end_time AS "end time",
ju_exit_status AS "exit state", ju_maxvmem AS "max vmem",
ju_ru_wallclock AS "wallclock" FROM sge_job_usage WHERE ju_id = '1327'
 
 
 
Thanks
 
John
 
 

________________________________

From: Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM] 
Sent: 25 May 2007 11:00
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Arco tool results differ from qacct


Hi John,

Can you also send the exact SQL statement that you used to retrieve this
information from the database.

Thanks

Jana

John Mc-Nicholas XJ (GU/ETL) wrote: 

	Hi Jana/Chansup/Daniel
	
	Thanks for your help so far on this issue.
	There seems to be something very weird going on!
	I started comparing individual jobs on the arco tool and qacct.
	
	I found a job that gets 2 very different sets of data depending
on where
	you look!
	Even the date of job start/end is different! It turn out that
job 1327
	on qacct corresponds to job 1339 on ARCO!
	
	
	In qacct:
	johnick at seasub1[~]# qacct -j 1327
	==============================================================
	qname        seashell.q          
	hostname     seashell            
	group        staff               
	owner        etlelby             
	project      NONE                
	department   ELS                 
	jobname      startdelhir11aca5   
	jobnumber    1327                
	taskid       undefined
	account      sge                 
	priority     0                   
	qsub_time    Wed May 23 08:20:17 2007
	start_time   Wed May 23 08:20:49 2007
	end_time     Wed May 23 11:00:29 2007
	granted_pe   NONE                
	slots        1                   
	failed       0    
	exit_status  0                   
	ru_wallclock 9580         
	ru_utime     854          
	ru_stime     59           
	cpu          913          
	mem          564.852           
	io           0.000             
	iow          0.000             
	maxvmem      1.054G
	
	On Arco:
	 
	 
	
	mem	   	 start	             id	ju_cpu  hostname
end time
	exit state max vmem  wallclock	
	736.666737	 2007-05-22 08:49:41.0	 1327	1323
seashell
	2007-05-22 11:39:47.0	0	 1.26E+09	10206	
	
	
	I figured that this data was too wrong to be the same job & sure
enough
	job 1339 matches the qacct for 1327! (apart from vmem!)
	
	mem	   	 start	             id	ju_cpu  hostname
end time
	exit state max vmem  wallclock		
	564.852292	2007/05/23 08:20		1339	 913
	seashell	2007/05/23 11:00	 	0
1.13E+09
	9580
	  
	
	What is going on?? --has my entire sql database been corrupted?
	How can I verify this and more importantly fix it so the QACCT
and ARCO
	match.
	
	Kind Regards
	
	John
	
	
	
	
	 
	
	-----Original Message-----
	From: Jana.Olivova at Sun.COM [mailto:Jana.Olivova at Sun.COM] 
	Sent: 21 May 2007 18:50
	To: users at gridengine.sunsource.net
	Subject: Re: [GE users] Arco tool results differ from qacct
	
	Hi Chansup,
	
	Hmm, it does not look like that. The table sge_job_usage has
fields
	ju_failed, ju_exit_status and jobs that have different exit
status than
	0 are recorded and the view_accounting does not filter those
out.
	
	Jana
	
	Chansup Byun wrote:
	  

		Hi Jana,
		
		I could be wrong but if I remember correctly the
sge_job_usage table 
		in ARCO  only stores jobs completed successfully.
		However, qacct also stores jobs failed with errors.
		
		Regards,
		
		- Chansup
		
		Jana Olivova wrote:
		    

			Hi,
			
			I don't see anything wrong with the query. You
can also use the 
			predefined Accounting per Department query,
which does the same.
			
			I checked my setup with MySQL database and I get
the same results 
			with both ARCo and qacct. I don't have any
sensible data in my 
			Postgres db, because I was using the same grid
with 3 different 
			databases. So the only month I can compare is is
this one:
			
			qacct -b 200705010000 -e 200705312359 Total
System Usage
			    WALLCLOCK         UTIME         STIME
CPU             
			MEMORY                 IO                IOW
	
=====================================================================
			===========================================
			
			       889909             2            36
415              
			0.275              0.000              0.000
			
			ARCo Accounting per Department
			
			2007-05-01
			cpu     mem     io
			defaultdepartment     415.155821
0.275125999999997     0.0
			
			
			The one explanation for this, of course, would
be if the same 
			database is used for more grids and/or (for
February) that reporting 
			was not enabled the whole time. Not sure if that
is a likely scenario
			      

	
	  

			for you.
			
			Regards,
			
			Jana
			
			John Mc-Nicholas XJ (GU/ETL) wrote:
			      

				Hi Jana/Daniel
				
				In this case I use database
:sge_job_usage, but I have also used the
				        

	
	  

				accounting database.
				qacct groups jobs according to the jobs
start time? I've done the 
				same for the SQL query.
				So this SQL SHOULD TOTAL UP THE MEMORY
GBS for all the jobs started 
				within each month.
				
				
				SQL:
				SELECT date_trunc('month',
ju_start_time) AS month, SUM (ju_mem) AS 
				"mem "  FROM sge_job_usage WHERE
ju_start_time  >  
				(current_timestamp - interval '1 year')
GROUP BY month ORDER BY 
				month; resulting table
				month               mem   
				2007-02-01 00:00:00.0 532138.750717
2007-03-01 00:00:00.0
				5274933.144317 2007-04-01 00:00:00.0
6884688.555405 2007-05-01 
				00:00:00.0 2789895.540273 Here are the
results from qacct command. 
				Compare the MEMORY column to table
above.
				The results differ by a significant
amount. A query on ju_cpu 
				results in a similar discrepency.
				qacct johnick at seasub1[~]# qacct -b
200702010000 -e 200702312359 
				Total System Usage
				    WALLCLOCK         UTIME
STIME           CPU
				MEMORY                 IO
IOW
	
====================================================================
				====
				
				========================================
				      2433584        289462
131581        854446
				567582.583              0.000
0.000
				johnick at seasub1[~]# qacct -b
200703010000 -e 200703312359 Total 
				System Usage
				    WALLCLOCK         UTIME
STIME           CPU
				MEMORY                 IO
IOW
	
====================================================================
				====
				
				========================================
				      4753132       1041297
53389       2957120
				3923641.991              0.000
0.000
				johnick at seasub1[~]# qacct -b
200704010000 -e 200704312359 Total 
				System Usage
				    WALLCLOCK         UTIME
STIME           CPU
				MEMORY                 IO
IOW
	
====================================================================
				====
				
				========================================
				      6118415       2063020
140069       4094226
				5743492.079              0.000
0.000
				johnick at seasub1[~]# qacct -b
200705010000 -e 200705312359 Total 
				System Usage
				    WALLCLOCK         UTIME
STIME           CPU
				MEMORY                 IO
IOW
	
====================================================================
				====
				
				========================================
				      2746486        983188
156462       1761848
				2388992.294              0.000
0.000
				
				
				
				-----Original Message-----
				From: Jana.Olivova at Sun.COM
[mailto:Jana.Olivova at Sun.COM] Sent: 18 
				May 2007 18:58
				To: users at gridengine.sunsource.net
				Subject: Re: [GE users] Arco tool
results differ from qacct
				
				I have problem replicating the issue,
though. I keep running jobs 
				(using Maintrunk GE) and the numbers
keep matching.
				
				Jana
				
				Daniel Templeton wrote:
				 
				        

				It may be worth noting that qacct and
ARCo use different source 
				data files.  qacct uses the accounting
file, and ARCo uses the 
				reporting file.  It is not inconceivable
that there could be an 
				issue such that the qmaster might write
different data to the two 
				files in some cases.
				    
				          

				 
				        

				Just a thought.
				
				Daniel
				
				Jana Olivova wrote:
				   
				          

				Hi John,
				
				I could check on the Arco side. I have
checked my data and they 
				are both the same, except the rounding
that appears in qacct. I do
				            

	
	  

				have, however, very small sample of
data. Frankly, I am not sure 
				what would
				      
				            

				 
				        

				cause this. Arco only inserts the data
that is given to it by the 
				qmaster, in the reporting file.
				
				Can you tell me what sql query did you
use to obtain the data in 
				ARCo
				      
				            

				 
				        

				and what database are you using?
				
				Jana Olivova
				
				John Mc-Nicholas XJ (GU/ETL) wrote:
				     
				            

				Hi All
				
				I am basically having the same problem
that Todd Heywood had 
				earlier
				        
				              

				 
				        

				in the year.
				He gave up on Arco tool in the end , I
hope I haven't got to do 
				the same.
				
				       
				              

				/ Heywood, Todd wrote:/ >/> How does
ACRo report time and 
				memory? I
				          
				                

				assumed it would be the same as/ >/> for
qacct, for which it is 
				seconds and Gbytes (according to "man/
>/> accounting"). But 
				qacct and ACRo are reporting different
numbers. Unit/ >/> 
				conversions don't account for the diffs/
				
				The Arco Tool produces nice graphs and
the SQL works fine but 
				when I
				        
				              

				 
				        

				compare to the output of QACCT , it is a
completely different set
				              

	
	  

				of
				        
				              

				 
				        

				results.
				
				There is some correlation between the
data. For example, Aprils 
				usage is the highest in both sets of
results & The users with the
				              

	
	  

				most usage also correspond in both sets
of data.
				But the actual data seems to be randomly
out by an order of
				              

	20-30%.
	  

				I'm specifically trying to extract grid
jobs memory (Gigabyte
				seconds) per month
				For example the data for April
				qacct -b 200704010000 -e 200704312359
MEMORY 5743492.079
				
				But the output in arco gives.........
				6324866.240448
				
				Is this a bug in ARCO/GRID ?
				What would cause this behaviour?
				
				The only strange thing I've noticed is
that I have 2 dbwriter 
				process instead of 1 & 5 postmaster
instead of 3.
				
				
				sgeadm 1430 1422 0 May 10 ? 0:00 /bin/sh

				/grid/dbwriter/util/dbwriter.sh sgeadm
1422 1 0 May 10 ? 0:00 
				/bin/sh /grid/dbwriter/util/dbwriter.sh
postgres 1402 1401 0 May 
				10 ? 0:00
/usr/local/pgsql/bin/postmaster -D 
				/usr/local/pgsql/database -S postgres
1403 1402 0 May 10 ? 0:01 
				/usr/local/pgsql/bin/postmaster -D
/usr/local/pgsql/database -S 
				postgres 1401 1 0 May 10 ? 0:04
/usr/local/pgsql/bin/postmaster 
				-D /usr/local/pgsql/database -S postgres
13303 1401 0 16:29:34 ?
				0:00 /usr/local/pgsql/bin/postmaster -D
/usr/local/pgsql/database
				              

	
	  

				-S postgres 9719 1401 0 14:31:33 ? 0:20 
				/usr/local/pgsql/bin/postmaster
				        
				              

				 
				        

				-D /usr/local/pgsql/database -S
				
				If you've any ideas please get back to
me & I'll give you more 
				detailed info.
				
				Best Regards
				
				John
				*/ John Mc Nicholas /*
				
				* STE/SEA Support Engineer *
				* BETE Test Plants UK *
				E
				
				Phone: +44 (0) 1483 305458
				Email: john.xj.mc-nicholas at ericsson.com
				Address: Ericsson, Midleton Gate,
Guildford Business Park, 
				Guildford, Surrey, GU2 8SG , UK
				
				/ Ericsson Limited /
				/ Registered Office: Unit 4, Midleton
Gate, Guildford Business 
				Park,
				        
				              

				 
				        

				Guildford, Surrey, GU2 8SG / /
Registered Number in England and
				Wales: 942215 / / This communication is
confidential and intended
				              

	
	  

				solely for the addressee(s). Any
unauthorised review, use, 
				disclosure or distribution is
prohibited. If you believe this 
				message has been sent to you in error,
please notify the sender 
				by replying to this transmission and
delete the message without 
				disclosing it. Thank you.
				Ericsson Limited does not enter into
contracts or contractual 
				obligations via electronic mail, unless
otherwise agreed in 
				writing between the parties concerned.
				E-mail including attachments is
susceptible to data corruption, 
				interruption, unauthorised amendment,
tampering and viruses, and 
				we only send and receive e-mails on the
basis that we are not 
				liable for any such corruption,
interception, amendment, 
				tampering or viruses or any consequences
thereof. /
				
				
				
				        
				              

	
------------------------------------------------------------------
				---
				---
				
	
------------------------------------------------------------------
				--- To unsubscribe, e-mail: 
	
users-unsubscribe at gridengine.sunsource.net
				For additional commands, e-mail: 
				users-help at gridengine.sunsource.net
				        
				            

	
-------------------------------------------------------------------
				-- To unsubscribe, e-mail: 
	
users-unsubscribe at gridengine.sunsource.net
				For additional commands, e-mail: 
				users-help at gridengine.sunsource.net
				
				    
				          

	
--------------------------------------------------------------------
				- To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
				For additional commands, e-mail:
users-help at gridengine.sunsource.net
				
				  
				        

	
---------------------------------------------------------------------
			---
			
	
---------------------------------------------------------------------
			To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
			For additional commands, e-mail:
users-help at gridengine.sunsource.net
			  
			      

	
---------------------------------------------------------------------
		To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
		For additional commands, e-mail:
users-help at gridengine.sunsource.net
		
		    

	
	
	
---------------------------------------------------------------------
	To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
	For additional commands, e-mail:
users-help at gridengine.sunsource.net
	
	  





More information about the gridengine-users mailing list