[GE users] User Time + System Time != Wall Clock Time

Reuti reuti at staff.uni-marburg.de
Tue Apr 15 22:19:16 BST 2008


Hey Azhar,

Am 15.04.2008 um 22:12 schrieb Azhar Ali Shah:
> Many thanks for al this help. Is there any way I can use the  
> accounting script to overwrite the time values in notification email?

mmh - in principle: yes! You will need a custom mail-wrapper, which  
will first wait a few minutes, to be sure that the last slave-qrsh  
was already written to the accounting file. With mpich2 jobs, you can  
then use the script and send some information from it instead of the  
supplied body (as the master task isn't doing much work with mpich2).  
For other parallel job type you will need to add the information of  
all qrsh tasks to the information in the body which you should send  
originally.

-- Reuti

PS: Maybe there should be a "hook" for a custom mailer, which is  
triggered after the accounting record for the master task was written  
to the accounting file. This way the mails would also come (again)  
from the head node.


> thanks
> Azhar
>
>
> Reuti <reuti at staff.uni-marburg.de> wrote: Hi,
>
> Am 13.04.2008 um 18:36 schrieb Azhar Ali Shah:
> > I am using rsh with daemon-based smpd (mpich2-1.0.7rc2) startup
> > method. The ps -e f gives:
> >
> > 5769 1 5768 /usr/SGE6/bin/lx24-x86/sge_qmaster
> > 5789 1 5789 /usr/SGE6/bin/lx24-x86/sge_schedd
> > 6337 1 6337 /usr/SGE6/bin/lx24-x86/sge_execd
> > 25736 6337 25736 \_ sge_shepherd-18 -bg
> > 25837 25736 25837 | \_ -sh /usr/SGE6/default/spool/taramel/
> > job_scripts/18
> > 25915 25837 25837 | \_ mpiexec -n 4 -machinefile /tmp/
> > 18.1.all.q/machines
> > 25806 6337 25806 \_ sge_shepherd-18 -bg
> > 25807 25806 25807 \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
> > 25813 25807 25813 \_ /usr/SGE6/utilbin/lx24-x86/
> > qrsh_starter /usr/SGE6/
> > 25815 25813 25815 \_ /home/aas/local/mpich2_smpd/bin/
> > smpd -port 200
> > 25916 25815 25815 \_ /home/aas/local/mpich2_smpd/
> > bin/smpd -port
> > 25917 25916 25815 \_ /home/aas/par_procksi_Alone
> > 26641 25917 25815 | \_ ./fast /home/aas/
> > workspace/AzharPe
> > 25918 25916 25815 \_ /home/aas/par_procksi_Alone
> > 26640 25918 25815 \_ ./fast /home/aas/
> > workspace/AzharPe
> > ...
> > 25772 1 25737 /usr/SGE6/bin/lx24-x86/qrsh -inherit taramel /
> > home/aas/local/m
> > 25808 25772 25737 \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 57419
> > taramel.cs.nott.ac
> > 25814 25808 25737 \_ [rsh]
> > 25774 1 25737 /usr/SGE6/bin/lx24-x86/qrsh -inherit smeg /home/
> > aas/local/mpic
> > 25817 25774 25737 \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 33059
> > smeg.cs.nott.ac.uk
> > 25818 25817 25737 \_ [rsh]
> > 25777 1 25737 /usr/SGE6/bin/lx24-x86/qrsh -inherit eomer /home/
> > aas/local/mpi
> > 25819 25777 25737 \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 33207
> > eomer.cs.nott.ac.u
> > 25820 25819 25737 \_ [rsh]
> >
> > but still i don't get any values for User and System time
> > parameters as:
> > Job 19 (mpich2.sh) Complete
> > User = aas
> > Queue = all.q at xxx
> > Host = taramel.cs.nott.ac.uk
> > Start Time = 04/13/2008 16:19:00
> > End Time = 04/13/2008 17:22:06
> > User Time = 00:00:00
> > System Time = 00:00:00
> > Wallclock Time = 01:03:06
> > CPU = 00:00:00
> > Max vmem = 10.074M
> > Exit Status = 0
> > Any ideas on how to change this behavior?
>
> this is the output of the master task of the parallel job, which is
> not doing much work. You will need all entries in the accounting
> file: qacct -j 19 and add them up.
>
> -- Reuti
>
> PS: I'm not sure, whether I posted this script already to the list to
> do it for you. Just try: accounting -j 19
>
> #!/bin/bash
> #
> # accounting will reformat the output of the qacct command by SGE.
> #
> # Version 1.0 - 2005-03-16 Initial release
> #
> # Version 1.1 - 2006-09-15 Added -j to select a specific job
> #
> # Copyright (C) 2006 Reuti, email: reuti at staff.uni-marburg.de
> #
> # This program is free software; you can redistribute it and/or modify
> # it under the terms of the GNU General Public License as published by
> # the Free Software Foundation; either version 2 of the License, or
> # (at your option) any later version.
> #
> # This program is distributed in the hope that it will be useful,
> # but WITHOUT ANY WARRANTY; without even the implied warranty of
> # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> # GNU General Public License for more details.
> #
> # You should have received a copy of the GNU General Public License
> # along with this program; if not, write to the Free Software
> # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
> 02111-1307 USA
> #
>
> #
> # If there is an unrecoverable error: display a message and exit.
> #
>
> function printExit
> {
> case $1 in
> [iI]) echo INFO: "$2" >&2 ;;
> [wW]) echo WARNING: "$2" >&2 ;;
> [eE]) echo ERROR: "$2" ; exit 1 >&2 ;;
> *) echo $1 >&2 ;;
> esac
> }
>
> #
> # System depended setups.
> #
>
> platform=`uname -s`
> case $platform in
>
> SunOS) awk_command=/usr/xpg4/bin/awk ;;
>
> *) awk_command=awk ;;
>
> esac
> awk_path=`which $awk_command`
> if [ ! -r "$awk_path" -o ! -x "$awk_path" ] ; then
> printExit W "No executable awk program found."
> printExit E "Please update the path to awk in \"status\"
> according to your installation."
> fi
>
> #
> # First define some functions.
> #
>
> function usage
> {
> cat <<-EOF
>
> NAME
> accounting - Display detailed accounting
>
> SYNTAX
> accounting [ options ] [ ]
>
> DESCRIPTION
> Displays the accounting of your own by default. When a
> is specified, only the accounting belonging to this user will
> be displayed.
>
> OPTIONS
>
> -a
> Display the accounting of all users. May not be used, when
> there
> is a dedicated user specified.
>
> -j
> Calculate the total account information for one (parallel)
> job. May
> not be used, when there is a dedicated or all user(s)
> specified.
>
> -h
> Get this help page.
>
> Version 1.1 15 September 2006
> accounting(1)
> EOF
> exit 0
> }
>
> #
> # Analyze the given parameters to the command.
> #
>
> while getopts :ahj: options ; do
> case $options in
>
> a) CMDOPT_A="1" ;;
>
> h) usage ;;
>
> j) CMDOPT_J="1"
> job=$OPTARG ;;
>
> \?) printExit E "Invalid option: -$OPTARG." ;;
>
> esac
> done
>
> #
> # Shift the arguments up to the input without the option prefix "-".
> # This should be the user name.
> #
>
> shift $((OPTIND-1))
>
> #
> # Test, whether was a user specified.
> #
>
> if [ -n "$CMDOPT_A" ] ; then
> if [ -n "$1" ] ; then
> printExit E "You may only specify -a *or* an user, but not
> both."
> else
> myuser="*"
> fi
> else
> if [ -n "$1" ] ; then
> myuser="$1"
> else
> myuser="$USER"
> fi
> selection="-o $myuser"
> fi
>
> #
> # Limit to one job if requested.
> #
>
> if [ -n "$CMDOPT_J" ]; then
> if [ -n "$1" -o -n "$CMDOPT_A" ] ; then
> printExit E "You may only specify -j *or* an/all
> user(s), but not both."
> else
> selection="$job"
> fi
> fi
>
> #
> # Now do it.
> #
>
> #
> # Do the accounting.
> #
>
> qacct -j $selection | $awk_command '
>
> BEGIN { firstrun=1 }
>
> /^account/ { if (firstrun)
> {
> firstrun=0
> }
>
> account=$2
>
> }
>
> /^ru_wallclock/ { wallclock[account]+=$2 }
>
> /^ru_utime/ { utime[account]+=$2 }
>
> /^ru_stime/ { stime[account]+=$2 }
>
> /^cpu/ { cpu[account]+=$2 }
>
> /^mem/ { mem[account]+=$2 }
>
> /^iow/ { iow[account]+=$2
> next }
>
> /^io/ { io[account]+=$2 }
>
> END { if (! firstrun)
> {
> if (jobonly)
> { printf ("Accounting for job: %s\n",
> jobnumber) }
> else
> { printf ("Accounting for user: %s\n", user) }
> printf ("\n")
> printf ("ACCOUNT WALLCLOCK
> UTIME STIME CPU MEMORY
> IO IOW\n")
> printf
> ("==================================================================== 
> ==
> ======================================================\n")
> sort_command="sort"
>
> for (account in utime)
> {
> total_wallclock+=wallclock[account]
> total_utime+=utime[account]
> total_stime+=stime[account]
> total_cpu+=cpu[account]
> total_mem+=mem[account]
> total_io+=io[account]
> total_iow+=iow[account]
>
> printf("%-15s %9d %13d %13d %13d %18.3f %18.3f
> %18.3f\n",
> account, wallclock[account], utime
> [account], stime[account], cpu[account], mem[account], io[account],
> iow[account]) | sort_command
> }
>
> close(sort_command)
>
> printf
> ("==================================================================== 
> ==
> ======================================================\n")
> printf("Total %9d %13d %13d %13d %18.3f %
> 18.3f %18.3f\n",
> total_wallclock, total_utime, total_stime,
> total_cpu, total_mem, total_io, total_iow)
>
> }
> } ' jobonly="$CMDOPT_J" jobnumber="$job" user="$myuser"
>
> #
> # So, that's all
> #
>
> exit 0
>
>
> > thanks
> > Azhar
> >
> >
> >
> > Reuti wrote: Hi,
> >
> > Am 03.04.2008 um 12:24 schrieb Azhar Ali Shah:
> > > Running a parallel job with MPICH2-1.0.7 + SGE demanding 4
> > > processors on my cluster gives following statistics:
> > >
> > > Job 152 (DS1001-4P) Complete
> > > User = aas
> > > Queue = all.q at xxxx
> > > Host = smeg.cs.nott.ac.uk
> > > Start Time = 04/02/2008 20:07:37
> > > End Time = 04/03/2008 00:09:55
> > > User Time = 00:00:18
> > > System Time = 00:00:04
> > > Wallclock Time = 04:02:18
> > > CPU = 00:00:22
> > > Max vmem = 8.551M
> > > Exit Status = 0
> > >
> > > I wonder why user time and system time are so minimum as compared
> > > to wall clock time. Earlier to this, I ran same task with same  
> data
> > > as a sequential job on single machine that gave following
> > statistics:
> > >
> > > ob 35 (batchjob.sh) Complete
> > > User = aas
> > > Queue = all.q at xxxx
> > > Host = smeg.cs.nott.ac.uk
> > > Start Time = 03/06/2008 17:01:34
> > > End Time = 03/08/2008 04:50:20
> > > User Time = 1:01:18:28
> > > System Time = 06:07:43
> > > Wallclock Time = 1:11:48:46
> > > CPU = 1:07:26:11
> > > Max vmem = 398.684M
> > > Exit Status = 0
> > >
> > > With number of processor being 4 in parallel job I can assume the
> > > Wall Clock to be true but I cann't understand the values of User
> > > and System time in parallel version above. Any thoughts?
> >
> > these are the typical symptoms when your application is not tightly
> > integrated into SGE. Can you check with "ps -e f" , that you are a)
> > using SGE's rsh command and b) all child processes are bound to the
> > the sge_execd? Using plain system's /usr/bin/rsh or ssh will
> > otherwise lead to such a behavior. If you need ssh, you have to
> > recompile SGE on your own to get a custom-built ssh including the
> > tight intergration facility.
> >
> > (BTW: the wallclock time looks more like you used 8 cores IMO)
> >
> > -- Reuti
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam? Yahoo! Mail has the best spam protection around
> > http://mail.yahoo.com
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> between 0000-00-00 and 9999-99-99
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list