[GE users] User Time + System Time != Wall Clock Time

Reuti reuti at staff.uni-marburg.de
Mon Apr 14 17:09:02 BST 2008


Hi,

Am 13.04.2008 um 18:36 schrieb Azhar Ali Shah:
> I am using rsh with daemon-based smpd (mpich2-1.0.7rc2) startup  
> method. The ps -e f gives:
>
> 5769     1  5768 /usr/SGE6/bin/lx24-x86/sge_qmaster
>  5789     1  5789 /usr/SGE6/bin/lx24-x86/sge_schedd
>  6337     1  6337 /usr/SGE6/bin/lx24-x86/sge_execd
> 25736  6337 25736  \_ sge_shepherd-18 -bg
> 25837 25736 25837  |   \_ -sh /usr/SGE6/default/spool/taramel/ 
> job_scripts/18
> 25915 25837 25837  |       \_ mpiexec -n 4 -machinefile /tmp/ 
> 18.1.all.q/machines
> 25806  6337 25806  \_ sge_shepherd-18 -bg
> 25807 25806 25807      \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
> 25813 25807 25813          \_ /usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter /usr/SGE6/
> 25815 25813 25815              \_ /home/aas/local/mpich2_smpd/bin/ 
> smpd -port 200
> 25916 25815 25815                  \_ /home/aas/local/mpich2_smpd/ 
> bin/smpd -port
> 25917 25916 25815                      \_ /home/aas/par_procksi_Alone
> 26641 25917 25815                      |   \_ ./fast /home/aas/ 
> workspace/AzharPe
> 25918 25916 25815                      \_ /home/aas/par_procksi_Alone
> 26640 25918 25815                          \_ ./fast /home/aas/ 
> workspace/AzharPe
> ...
> 25772     1 25737 /usr/SGE6/bin/lx24-x86/qrsh -inherit taramel / 
> home/aas/local/m
> 25808 25772 25737  \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 57419  
> taramel.cs.nott.ac
> 25814 25808 25737      \_ [rsh] <defunct>
> 25774     1 25737 /usr/SGE6/bin/lx24-x86/qrsh -inherit smeg /home/ 
> aas/local/mpic
> 25817 25774 25737  \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 33059  
> smeg.cs.nott.ac.uk
> 25818 25817 25737      \_ [rsh] <defunct>
> 25777     1 25737 /usr/SGE6/bin/lx24-x86/qrsh -inherit eomer /home/ 
> aas/local/mpi
> 25819 25777 25737  \_ /usr/SGE6/utilbin/lx24-x86/rsh -p 33207  
> eomer.cs.nott.ac.u
> 25820 25819 25737      \_ [rsh] <defunct>
>
> but still i don't get any values for User and System time  
> parameters as:
> Job 19 (mpich2.sh) Complete
> User = aas
> Queue = all.q at xxx
> Host = taramel.cs.nott.ac.uk
> Start Time = 04/13/2008 16:19:00
> End Time = 04/13/2008 17:22:06
> User Time = 00:00:00
> System Time = 00:00:00
> Wallclock Time = 01:03:06
> CPU = 00:00:00
> Max vmem = 10.074M
> Exit Status = 0
> Any ideas on how to change this behavior?

this is the output of the master task of the parallel job, which is  
not doing much work. You will need all entries in the accounting  
file: qacct -j 19 and add them up.

-- Reuti

PS: I'm not sure, whether I posted this script already to the list to  
do it for you. Just try: accounting -j 19

#!/bin/bash
#
# accounting will reformat the output of the qacct command by SGE.
#
# Version 1.0 - 2005-03-16 Initial release
#
# Version 1.1 - 2006-09-15 Added -j to select a specific job
#
# Copyright (C) 2006 Reuti, email: reuti at staff.uni-marburg.de
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA   
02111-1307  USA
#

#
# If there is an unrecoverable error: display a message and exit.
#

function printExit
{
     case $1 in
         [iI]) echo INFO: "$2" >&2 ;;
         [wW]) echo WARNING: "$2" >&2 ;;
         [eE]) echo ERROR: "$2" ; exit 1 >&2 ;;
            *) echo $1 >&2 ;;
     esac
}

#
# System depended setups.
#

platform=`uname -s`
case $platform in

     SunOS) awk_command=/usr/xpg4/bin/awk ;;

         *) awk_command=awk ;;

esac
awk_path=`which $awk_command`
if [ ! -r "$awk_path" -o ! -x "$awk_path" ] ; then
     printExit W "No executable awk program found."
     printExit E "Please update the path to awk in \"status\"  
according to your installation."
fi

#
# First define some functions.
#

function usage
{
     cat <<-EOF

	NAME
	    accounting - Display detailed accounting

	SYNTAX
	    accounting [ options ] [ <user> ]

	DESCRIPTION
	    Displays the accounting of your own by default. When a <user>
	    is specified, only the accounting belonging to this user will
	    be displayed.

	OPTIONS

	    -a
	        Display the accounting of all users. May not be used, when  
there
	        is a dedicated user specified.

	    -j <jobnumber>
	        Calculate the total account information for one (parallel)  
job. May
	        not be used, when there is a dedicated or all user(s)  
specified.

	    -h
	        Get this help page.

         Version 1.1            15 September 2006             
accounting(1)
	EOF
     exit 0
}

#
# Analyze the given parameters to the command.
#

while getopts :ahj: options ; do
     case $options in

         a) CMDOPT_A="1" ;;

         h) usage ;;

         j) CMDOPT_J="1"
            job=$OPTARG ;;

        \?) printExit E "Invalid option: -$OPTARG." ;;

     esac
done

#
# Shift the arguments up to the input without the option prefix "-".
# This should be the user name.
#

shift $((OPTIND-1))

#
# Test, whether was a user specified.
#

if [ -n "$CMDOPT_A" ] ; then
     if [ -n "$1" ] ; then
         printExit E "You may only specify -a *or* an user, but not  
both."
     else
         myuser="*"
     fi
else
     if [ -n "$1" ] ; then
         myuser="$1"
     else
         myuser="$USER"
     fi
     selection="-o $myuser"
fi

#
# Limit to one job if requested.
#

if [ -n "$CMDOPT_J" ]; then
     if [ -n "$1" -o -n "$CMDOPT_A" ] ; then
         printExit E "You may only specify -j <jobnumber> *or* an/all  
user(s), but not both."
     else
         selection="$job"
     fi
fi

#
# Now do it.
#

#
# Do the accounting.
#

qacct -j $selection | $awk_command '

         BEGIN { firstrun=1 }

         /^account/           { if (firstrun)
                                {
                                    firstrun=0
                                }

                                account=$2

                              }

         /^ru_wallclock/      { wallclock[account]+=$2 }

         /^ru_utime/          { utime[account]+=$2 }

         /^ru_stime/          { stime[account]+=$2 }

         /^cpu/               { cpu[account]+=$2 }

         /^mem/               { mem[account]+=$2 }

         /^iow/               { iow[account]+=$2
                                next }

         /^io/                { io[account]+=$2 }

         END { if (! firstrun)
               {
                   if (jobonly)
                       { printf ("Accounting for job: %s\n",  
jobnumber) }
                   else
                       { printf ("Accounting for user: %s\n", user) }
                   printf ("\n")
                   printf ("ACCOUNT         WALLCLOCK          
UTIME         STIME           CPU             MEMORY                  
IO                IOW\n")
                   printf  
("====================================================================== 
======================================================\n")
                   sort_command="sort"

                   for (account in utime)
                   {
                       total_wallclock+=wallclock[account]
                       total_utime+=utime[account]
                       total_stime+=stime[account]
                       total_cpu+=cpu[account]
                       total_mem+=mem[account]
                       total_io+=io[account]
                       total_iow+=iow[account]

                       printf("%-15s %9d %13d %13d %13d %18.3f %18.3f  
%18.3f\n",
                              account, wallclock[account], utime 
[account], stime[account], cpu[account], mem[account], io[account],  
iow[account]) | sort_command
                   }

                   close(sort_command)

                   printf  
("====================================================================== 
======================================================\n")
                   printf("Total           %9d %13d %13d %13d %18.3f % 
18.3f %18.3f\n",
                          total_wallclock, total_utime, total_stime,  
total_cpu, total_mem, total_io, total_iow)

               }
             } ' jobonly="$CMDOPT_J" jobnumber="$job" user="$myuser"

#
# So, that's all
#

exit 0


> thanks
> Azhar
>
>
>
> Reuti <reuti at staff.uni-marburg.de> wrote: Hi,
>
> Am 03.04.2008 um 12:24 schrieb Azhar Ali Shah:
> > Running a parallel job with MPICH2-1.0.7 + SGE demanding 4
> > processors on my cluster gives following statistics:
> >
> > Job 152 (DS1001-4P) Complete
> > User = aas
> > Queue = all.q at xxxx
> > Host = smeg.cs.nott.ac.uk
> > Start Time = 04/02/2008 20:07:37
> > End Time = 04/03/2008 00:09:55
> > User Time = 00:00:18
> > System Time = 00:00:04
> > Wallclock Time = 04:02:18
> > CPU = 00:00:22
> > Max vmem = 8.551M
> > Exit Status = 0
> >
> > I wonder why user time and system time are so minimum as compared
> > to wall clock time. Earlier to this, I ran same task with same data
> > as a sequential job on single machine that gave following  
> statistics:
> >
> > ob 35 (batchjob.sh) Complete
> > User = aas
> > Queue = all.q at xxxx
> > Host = smeg.cs.nott.ac.uk
> > Start Time = 03/06/2008 17:01:34
> > End Time = 03/08/2008 04:50:20
> > User Time = 1:01:18:28
> > System Time = 06:07:43
> > Wallclock Time = 1:11:48:46
> > CPU = 1:07:26:11
> > Max vmem = 398.684M
> > Exit Status = 0
> >
> > With number of processor being 4 in parallel job I can assume the
> > Wall Clock to be true but I cann't understand the values of User
> > and System time in parallel version above. Any thoughts?
>
> these are the typical symptoms when your application is not tightly
> integrated into SGE. Can you check with "ps -e f" , that you are a)
> using SGE's rsh command and b) all child processes are bound to the
> the sge_execd? Using plain system's /usr/bin/rsh or ssh will
> otherwise lead to such a behavior. If you need ssh, you have to
> recompile SGE on your own to get a custom-built ssh including the
> tight intergration facility.
>
> (BTW: the wallclock time looks more like you used 8 cores IMO)
>
> -- Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list