[GE users] User Time + System Time != Wall Clock Time

Mulley, Nikhil Nikhil.Mulley at deshaw.com
Sun Apr 13 19:45:09 BST 2008


I believe this again has to do with the implementation of the getrusage
on Linux ??


________________________________

	From: Azhar Ali Shah [mailto:aas_lakyari at yahoo.com] 
	Sent: Sunday, April 13, 2008 10:07 PM
	To: users at gridengine.sunsource.net
	Subject: Re: [GE users] User Time + System Time != Wall Clock
Time
	
	
	Hi, 
	I am using rsh with daemon-based smpd (mpich2-1.0.7rc2) startup
method. The ps -e f gives:
	
	5769     1  5768 /usr/SGE6/bin/lx24-x86/sge_qmaster
	 5789     1  5789 /usr/SGE6/bin/lx24-x86/sge_schedd
	 6337     1  6337 /usr/SGE6/bin/lx24-x86/sge_execd
	25736  6337 25736  \_ sge_shepherd-18 -bg
	25837 25736 25837  |   \_ -sh
/usr/SGE6/default/spool/taramel/job_scripts/18
	25915 25837 25837  |       \_ mpiexec -n 4 -machinefile
/tmp/18.1.all.q/machines
	25806  6337 25806  \_ sge_shepherd-18 -bg
	25807 25806 25807      \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
	25813 25807 25813          \_
/usr/SGE6/utilbin/lx24-x86/qrsh_starter /usr/SGE6/
	25815 25813 25815              \_
/home/aas/local/mpich2_smpd/bin/smpd -port 200
	25916 25815 25815                  \_
/home/aas/local/mpich2_smpd/bin/smpd -port
	25917 25916 25815                      \_
/home/aas/par_procksi_Alone
	26641 25917 25815                      |   \_ ./fast
/home/aas/workspace/AzharPe
	25918 25916 25815                      \_
/home/aas/par_procksi_Alone
	26640 25918 25815                          \_ ./fast
/home/aas/workspace/AzharPe
	...
	25772     1 25737 /usr/SGE6/bin/lx24-x86/qrsh -inherit taramel
/home/aas/local/m
	25808 25772 25737  \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 57419
taramel.cs.nott.ac
	25814 25808 25737      \_ [rsh] <defunct>
	25774     1 25737 /usr/SGE6/bin/lx24-x86/qrsh -inherit smeg
/home/aas/local/mpic
	25817 25774 25737  \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 33059
smeg.cs.nott.ac.uk
	25818 25817 25737      \_ [rsh] <defunct>
	25777     1 25737 /usr/SGE6/bin/lx24-x86/qrsh -inherit eomer
/home/aas/local/mpi
	25819 25777 25737  \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 33207
eomer.cs.nott.ac.u
	25820 25819 25737      \_ [rsh] <defunct>
	
	but still i don't get any values for User and System time
parameters as:
	
	Job 19 (mpich2.sh) Complete
	 User             = aas
	 Queue            = all.q at xxx
<mailto:all.q at taramel.cs.nott.ac.uk> 
	 Host             = taramel.cs.nott.ac.uk
	 Start Time       = 04/13/2008 16:19:00
	 End Time         = 04/13/2008 17:22:06
	 User Time        = 00:00:00
	 System Time      = 00:00:00
	 Wallclock Time   = 01:03:06
	 CPU              = 00:00:00
	 Max vmem         = 10.074M
	 Exit Status      = 0

	Any ideas on how to change this behavior?
	
	thanks
	Azhar
	
	
	
	Reuti <reuti at staff.uni-marburg.de> wrote: 

		Hi,
		
		Am 03.04.2008 um 12:24 schrieb Azhar Ali Shah:
		> Running a parallel job with MPICH2-1.0.7 + SGE
demanding 4 
		> processors on my cluster gives following statistics:
		>
		> Job 152 (DS1001-4P) Complete
		> User = aas
		> Queue = all.q at xxxx
		> Host = smeg.cs.nott.ac.uk
		> Start Time = 04/02/2008 20:07:37
		> End Time = 04/03/2008 00:09:55
		> User Time = 00:00:18
		> System Time = 00:00:04
		> Wallclock Time = 04:02:18
		> CPU = 00:00:22
		> Max vmem = 8.551M
		> Exit Status = 0
		>
		> I wonder why user time and system time are so minimum
as compared 
		> to wall clock time. Earlier to this, I ran same task
with same data 
		> as a sequential job on single machine that gave
following statistics:
		>
		> ob 35 (batchjob.sh) Complete
		> User = aas
		> Queue = all.q at xxxx
		> Host = smeg.cs.nott.ac.uk
		> Start Time = 03/06/2008 17:01:34
		> End Time = 03/08/2008 04:50:20
		> User Time = 1:01:18:28
		> System Time = 06:07:43
		> Wallclock Time = 1:11:48:46
		> CPU = 1:07:26:11
		> Max vmem = 398.684M
		> Exit Status = 0
		>
		> With number of processor being 4 in parallel job I can
assume the 
		> Wall Clock to be true but I cann't understand the
values of User 
		> and System time in parallel version above. Any
thoughts?
		
		these are the typical symptoms when your application is
not tightly 
		integrated into SGE. Can you check with "ps -e f" , that
you are a) 
		using SGE's rsh command and b) all child processes are
bound to the 
		the sge_execd? Using plain system's /usr/bin/rsh or ssh
will 
		otherwise lead to such a behavior. If you need ssh, you
have to 
		recompile SGE on your own to get a custom-built ssh
including the 
		tight intergration facility.
		
		(BTW: the wallclock time looks more like you used 8
cores IMO)
		
		-- Reuti
		
	
---------------------------------------------------------------------
		To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
		For additional commands, e-mail:
users-help at gridengine.sunsource.net
		
		


	__________________________________________________
	Do You Yahoo!?
	Tired of spam? Yahoo! Mail has the best spam protection around 
	http://mail.yahoo.com 




More information about the gridengine-users mailing list