[GE users] reason for job abort in email [was: job summary info if mail is suppressed more]

Reuti reuti at staff.uni-marburg.de
Wed Jul 2 17:32:29 BST 2008


Hi (subject changed, as it might interest all users who want the  
reason for a job abort mentioned in their email in general),

Am 02.07.2008 um 17:05 schrieb Patterson, Ron (NIH/NLM/NCBI) [C]:

> We are in the process of evaluating SGE (6.2beta2) as a compliment to
> our current LSF cluster. Our evaluation has been going well enough  
> that
> we have asked some of our more savvy LSF users to send a few jobs
> through the SGE cluster. One of the first bits of feedback I got  
> was the
> following question:
>
> If a SGE user requests an email to be sent when the job ends (qsub - 
> m e
> ...), a nice summary of the job is sent when the job exits which
> includes a few useful stats and the jobs exit status. This info is  
> very
> similar to the info that LSF supplies as part of stdout when "bsub -o"

-o (besides -e) is also available in SGE to reroute the output.

> is used to define a stdout file. Is there a way to get this summary  
> info
> printed to stdout/stderr when a SGE jobs exits? Most of our users  
> submit
> many thousands of jobs so they will suppress any emails. It looks like
> they have become accustom to parsing the stdout files produced by  
> LSF to
> collect info about their jobs as they exit.
>
> BTW, we have already directed them to start looking at qacct, but I'm
> wondering if there is a solution which will allow as little change as
> possible in our users workflow.

it's not possible out of the box, as at the time of mailing the  
$SGE_STDOUT_PATH is no longer known to SGE - the job is already over.  
But with minimal scripting it can be done (looks more than it is here):

=========================================

a) define an epilog in the queue, which will write the content of  
$SGE_STDOUT_PATH to /tmp/sge/$JOB_ID, so that we know it later on:

#!/bin/sh
echo $SGE_STDOUT_PATH > /tmp/sge/$JOB_ID

(To avoid clutterring the /tmp is suggest an intermediate directory  
sge here. Maybe you have already a cron job running there to remove  
files after one week or so. You will have to create this /tmp/sge  
first on all nodes.)

=========================================

b) write a small mail wrapper (for convenience I use ours here and  
add just your request, which will also list the reason for a kill  
[like time or memory exceeded] - as this might interest also others  
on the list, first only the original one):


#!/bin/sh
JOB_ID=`echo "$2" | cut -d " " -f 2`
CONDITION=`echo "$2" | cut -d " " -f 4`
appendix=`grep "|job $JOB_ID\." /var/spool/sge/$HOSTNAME/messages |  
head -n 1`
if [ -z "$appendix" ]; then
     appendix="Unknown, no entry found in messages file on the master  
node of the job."
fi
if [ "$CONDITION" = "Aborted" ]; then
     (cat; echo; echo "Reason for job abort:"; echo $appendix) | mail  
-s "$2" "$3"
else
     mail -s "$2" "$3"
fi


And you would need a small change, to tee the email also to the  
stdout of the job. The idea is to change the email only for  
"completed" and "aborted" jobs, but not for all other emails send by  
SGE.


#!/bin/sh
JOB_ID=`echo "$2" | cut -d " " -f 2`
CONDITION=`echo "$2" | cut -d " " -f 4`
appendix=`grep "|job $JOB_ID\." /var/spool/sge/$HOSTNAME/messages |  
head -n 1`
if [ -z "$appendix" ]; then
     appendix="Unknown, no entry found in messages file on the master  
node of the job."
fi
if [ "$CONDITION" = "Aborted" ]; then
     (cat ; echo; echo "Reason for job abort:"; echo $appendix) | tee  
-a `cat /tmp/sge/$JOB_ID` | mail -s "$2" "$3"
elif [ "$CONDITION" = "Complete" ]; then
     tee -a `cat /tmp/sge/$JOB_ID` | mail -s "$2" "$3"
else
     mail -s "$2" "$3"
fi
if [ -n "$JOB_ID" ]; then
     rm -f /tmp/sge/$JOB_ID
fi

You will have to define this mail wrapper in "qconf -mconf" like:

mailer                       /usr/sge/cluster/mailer.sh

(Take care if you have local configurations for all the nodes. If  
they are the same anyway, I would suggest to remove them all and use  
only the global one.)

HTH - Reuti


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list