[GE users] User Time + System Time != Wall Clock Time

Azhar Ali Shah aas_lakyari at yahoo.com
Sun Apr 13 19:57:58 BST 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

but it gives current timings when a batch job is run without pe?


"Mulley, Nikhil" <Nikhil.Mulley at deshaw.com> wrote:     I believe this again has to do with the implementation of  the getrusage on Linux ??

       
---------------------------------
   From: Azhar Ali Shah    [mailto:aas_lakyari at yahoo.com] 
Sent: Sunday, April 13, 2008 10:07    PM
To: users at gridengine.sunsource.net
Subject: Re: [GE    users] User Time + System Time != Wall Clock Time


   
Hi, 
I am using rsh with daemon-based smpd (mpich2-1.0.7rc2)    startup method. The ps -e f gives:

5769     1     5768 /usr/SGE6/bin/lx24-x86/sge_qmaster
 5789        1  5789    /usr/SGE6/bin/lx24-x86/sge_schedd
 6337        1  6337 /usr/SGE6/bin/lx24-x86/sge_execd
25736  6337 25736     \_ sge_shepherd-18 -bg
25837 25736 25837  |   \_ -sh    /usr/SGE6/default/spool/taramel/job_scripts/18
25915 25837 25837     |       \_ mpiexec -n 4 -machinefile    /tmp/18.1.all.q/machines
25806  6337 25806  \_ sge_shepherd-18    -bg
25807 25806 25807      \_    /usr/SGE6/utilbin/lx24-x86/rshd -l
25813 25807    25813          \_    /usr/SGE6/utilbin/lx24-x86/qrsh_starter /usr/SGE6/
25815 25813    25815                 \_ /home/aas/local/mpich2_smpd/bin/smpd -port 200
25916 25815    25815                     \_ /home/aas/local/mpich2_smpd/bin/smpd -port
25917 25916    25815                         \_ /home/aas/par_procksi_Alone
26641 25917    25815                         |   \_ ./fast /home/aas/workspace/AzharPe
25918 25916    25815                         \_ /home/aas/par_procksi_Alone
26640 25918    25815                             \_ ./fast /home/aas/workspace/AzharPe
...
25772        1 25737 /usr/SGE6/bin/lx24-x86/qrsh -inherit taramel    /home/aas/local/m
25808 25772 25737  \_ /usr/SGE6/utilbin/lx24-x86/rsh    -p 57419 taramel.cs.nott.ac
25814 25808 25737         \_ [rsh] <defunct>
25774     1 25737    /usr/SGE6/bin/lx24-x86/qrsh -inherit smeg /home/aas/local/mpic
25817 25774    25737  \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 33059    smeg.cs.nott.ac.uk
25818 25817 25737      \_ [rsh]    <defunct>
25777     1 25737    /usr/SGE6/bin/lx24-x86/qrsh -inherit eomer /home/aas/local/mpi
25819 25777    25737  \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 33207    eomer.cs.nott.ac.u
25820 25819 25737      \_ [rsh]    <defunct>

but still i don't get    any values for User and System time parameters as:

Job 19 (mpich2.sh) Complete
 User             = aas
 Queue            = all.q at xxx
 Host             = taramel.cs.nott.ac.uk
 Start Time       = 04/13/2008 16:19:00
 End Time         = 04/13/2008 17:22:06
 User Time        = 00:00:00
 System Time      = 00:00:00
 Wallclock Time   = 01:03:06
 CPU              = 00:00:00
 Max vmem         = 10.074M
 Exit Status      = 0

Any    ideas on how to change this    behavior?

thanks
Azhar



Reuti    <reuti at staff.uni-marburg.de> wrote:   Hi,

Am      03.04.2008 um 12:24 schrieb Azhar Ali Shah:
> Running a parallel job      with MPICH2-1.0.7 + SGE demanding 4 
> processors on my cluster gives      following statistics:
>
> Job 152 (DS1001-4P) Complete
>      User = aas
> Queue = all.q at xxxx
> Host =      smeg.cs.nott.ac.uk
> Start Time = 04/02/2008 20:07:37
> End Time      = 04/03/2008 00:09:55
> User Time = 00:00:18
> System Time =      00:00:04
> Wallclock Time = 04:02:18
> CPU = 00:00:22
>      Max vmem = 8.551M
> Exit Status = 0
>
> I wonder why user      time and system time are so minimum as compared 
> to wall clock time.      Earlier to this, I ran same task with same data 
> as a sequential job      on single machine that gave following statistics:
>
> ob 35      (batchjob.sh) Complete
> User = aas
> Queue = all.q at xxxx
>      Host = smeg.cs.nott.ac.uk
> Start Time = 03/06/2008 17:01:34
>      End Time = 03/08/2008 04:50:20
> User Time = 1:01:18:28
> System      Time = 06:07:43
> Wallclock Time = 1:11:48:46
> CPU =      1:07:26:11
> Max vmem = 398.684M
> Exit Status =      0
>
> With number of processor being 4 in parallel job I can      assume the 
> Wall Clock to be true but I cann't understand the values      of User 
> and System time in parallel version above. Any      thoughts?

these are the typical symptoms when your application is not      tightly 
integrated into SGE. Can you check with "ps -e f" , that you are      a) 
using SGE's rsh command and b) all child processes are bound to the      
the sge_execd? Using plain system's /usr/bin/rsh or ssh will      
otherwise lead to such a behavior. If you need ssh, you have to      
recompile SGE on your own to get a custom-built ssh including the      
tight intergration facility.

(BTW: the wallclock time looks more      like you used 8 cores IMO)

--      Reuti

---------------------------------------------------------------------
To      unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For      additional commands, e-mail:    users-help at gridengine.sunsource.net


   __________________________________________________
Do You    Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around    
http://mail.yahoo.com 


 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



More information about the gridengine-users mailing list