[GE users] Email notification for script errors

reuti reuti at staff.uni-marburg.de
Tue Apr 6 10:15:32 BST 2010


Hi,

Am 02.04.2010 um 20:41 schrieb arvindpetaru:

> I would like to customize the email-body for notifications sent for job "Abortion" case like adding the script-error message in the email-body. Sorry, if this issue was already addressed before but I couldn't find much info on that except adding a new "mailer" wrapper which I'm not so clear at this point. Can anyone tell me how to append a error message in the email-body(Abortion case) for the submitted job.?
> 
> Basically, I'm trying to kill the submitted-job using "qdel $JOB_ID" when the submitted job sees any errors while running. I would like append this script-error in the abortion email notification.

there is no straight forward way to do this. The email is send when the job left the system (i.e. the execnode) already. Also the job's scratch directory will be removed at time the email will be send (depending on timing it might still be there, but it's not guaranteed), like it's the case for a global epilog.

Only place to store some surviving information is the queue epilog. What you need to have this working looks convoluted at the first glance, but we use it to send some other information about the job (in which way it was started), also the found information about the job in the messages file of the node will be send. This part could be cut out if you don't need it, but I just paste our scripts:

=> For all exechosts create a directory /var/spoo/sge/context when it's local or just one directory when the spool directory is shared


=> This should be owned by sgeadmin (or your admin user)


=> Before the `qdel` you have to write some error information to the context of the job. The job context is meta-information and just a comment for SGE. Like:

qalter -ac ERROR="Seek error" $JOB_ID

(please check the -ac, -sc and -dc commands for qsub and you also try to use them on the command line without any scripting at all, just to see the effect)


=> Create a queue epilog, which is run as sgeadmin, not the user:

$ qconf -sq all.q
...
epilog sgeadmin@/usr/sge/cluster/all.q.epilog


=> This epilog will then transfer the meta-information to the created "context" directory:

#!/bin/sh
. /usr/sge/default/common/settings.sh
if [ "$SGE_TASK_ID" != "undefined" ]; then
    JOB_ID=$JOB_ID.$SGE_TASK_ID
fi

COMMAND=`qstat -j $JOB_ID | grep -e "^context:"`
if [ -n "$COMMAND" ]; then
    if [ -d /var/spool/sge/context -a -w /var/spool/sge/context ]; then
        echo "$COMMAND" > /var/spool/sge/context/$JOB_ID
    fi
fi

#
# Be sure to exit with 0, even when the grep wasn't successful.
#

exit 0


=> Then the email wrapper has to append this information to the email:

$ cat mailer.sh 
#!/bin/sh

#
# Distinguish between normal jobs and an array job.
#

case `echo "$2" | cut -d " " -f 1` in

      Job) JOB_ID=`echo "$2" | cut -d " " -f 2`
           CONDITION=`echo "$2" | cut -d " " -f 4` ;;

Job-array) JOB_ID=`echo "$2" | cut -d " " -f 3`
           CONDITION=`echo "$2" | cut -d " " -f 5` ;;

        *) ;;

esac

#
# Get the entries for the context of the job and the
# reason in case of an abortion of the job.
#

if [ -f /var/spool/sge/context/$JOB_ID -a -r /var/spool/sge/context/$JOB_ID ]; then
    COMMAND=`cat /var/spool/sge/context/$JOB_ID`
    COMMAND=${COMMAND#*ERROR=}
    COMMAND=${COMMAND%%,*}
fi

if [ "$CONDITION" = "Aborted" ]; then
    if [ -f /var/spool/sge/$HOSTNAME/messages -a -r /var/spool/sge/$HOSTNAME/messages ]; then
        APPENDIX=`egrep "[|]job $JOB_ID([.][[:digit:]]+)? exceed" /var/spool/sge/$HOSTNAME/messages | head -n 1`
    fi

    if [ -z "$APPENDIX" ]; then
        APPENDIX="Unknown, no entry found in messages file on the master node of the job."
    fi
fi

#
# No construct and send the email.
#
 
if [ -n "$COMMAND" ]; then
    if [ -n "$APPENDIX" ]; then
        (cat; echo; echo "Reason for job abort:"; echo $APPENDIX; echo; echo "Job error was: $COMMAND") | mail -s "$2" "$3"
    else
        (cat; echo; echo "Job error was: $COMMAND") | mail -s "$2" "$3"
    fi
else
    if [ -n "$APPENDIX" ]; then
        (cat; echo; echo "Reason for job abort:"; echo $APPENDIX) | mail -s "$2" "$3"
    else
        mail -s "$2" "$3"
    fi
fi


=> To get rid of the (small) stuff in the "context" directory you need a cron job are extend SGE's logchecker.sh script.


-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252452

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list