[GE users] Some Hopefully Useful Scripts
guy.mareels at gmail.com
Thu May 28 14:06:43 BST 2009
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
the error indeed stems from the script-line with USER_TOTAL_CPUTIME you mentioned. It seems that qacct outputs CPUTIME in floating point, in contrast to Wall-clock-time which is integer.
OWNER WALLCLOCK UTIME STIME CPU MEMORY IO IOW
XXXxxxxx 711462 3501497.502 22604.638 3548712.481 1240228.791 0.000 0.000
I solved it using sum=`echo "scale=0; $sum+$num" | bc`. So using 'bc' with scale 0 to get rid of the floating point numbers.
This leads to an adaptation of the script in following way:
qacct -o "$i" -d $DAYS -q short.q | grep "$i" | tr -s " " "\t" | cut -f 7 | cut -d'.' -f1 >tempdata
echo $sum >sumdata
cat tempdata | \
while read num
sum=`echo "scale=0; $sum+$num" | bc`
echo $sum >sumdata
export USER_TOTAL_CPUTIME=`cat sumdata`
rm -f sumdata
rm -f tempdata
I use the export to the temporary files 'sumdata' and 'tempdata' because the $sum variable is somehow cleared when exiting the 'while'-loop. The while loop is necessary because I adapted the scripts to calculate the usages of the various queues we have defined on our HPC. By adding the '-q short.q', the output of qacct is a matrix of cpu- and wallclock-times, one for each node in the queue. So I need the while loop (and the temp-files) to add all the times for all the nodes.
Just a remark, outside the scope of my former problem: I tend to calculate the total wall clock time by -for each node- multiplying the wall clock time by the number of slots used. I found the wall clock time which is reported, is the real wall clock time the node was occupied. But this wall clock time does not take into account how much CPU-cores of your node were used. As our jobs tend to have quite a big difference between "CPU time" on one hand, and "wall clock time multiplied by number of CPU-cores used" (for example due to interactive jobs), this is necessary to calculate the real 'occupation' of our cluster. Due to the interactive jobs, CPU-time is no real indication of the usage of our cluster.
Thanks for your scripts, and for your help.
Let me know if you have additional remarks,
On Tue, May 26, 2009 at 11:10 PM, butters <chris at pearit.co.uk<mailto:chris at pearit.co.uk>> wrote:
Sorry it took me a while to get back to you, I was away last week.
I got very similar errors on my system a couple of times. I traced it back to these qacct calls;
USER_TOTAL_WALLCLOCK=`qacct -o "$i" -d $DAYS | grep "$i" | tr -s " " "\t" | cut -f 2`
USER_TOTAL_CPUTIME=`qacct -o "$i" -d $DAYS | grep "$i" | tr -s " " "\t" | cut -f 5`
I found that, for one user, SGE hadn't caclulated the totals that should be displayed with the 'qacct -o UserName -d Integer' command. This left the script with null values for these two variables, causing all the 'integer expression expected' and 'non-numeric argument' errors. These errors are why some of the figures aren't calculated at the end as well.
I never did find out why these figures weren't coming up in qacct - it suddenly fixed itself the next morning and hasn't happened again since!
In your case, the best way to figure it out would be to run the script like this:
bash -x user_job_stats.sh 1000 total
This'll produce tons of output, so make sure your terminal window is storing lots (I set my Putty sessions to '99999' lines when I'm doing this) of history. The first step is to locate which users this occuring for, or see if it's all of them, then I usually pick a random effected iteration on the for loop and step though the debug output to workout what's going on.
If you like, post the debug output for one iternation of the for loop here and I'll see what I can make of it for you.
Hope this helps,
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
More information about the gridengine-users