[GE users] failed job emails configuration

wagoodman wgoodman at jcvi.org
Wed Apr 21 23:33:32 BST 2010


THANK YOU THANK YOU...

You have help me so much.

Bill

-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Wednesday, April 21, 2010 4:45 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] failed job emails configuration

Am 21.04.2010 um 20:26 schrieb wagoodman:

> Thanks for the input... Where would I find the RFE? below is a snippet
> of thousands,

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1010

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1902

but as I know realize, your problem looks different: the emails you  
get are send before the prolog. Nevertheless you can put it in an  
email-wrapper, which's name you define in `qconf -mconf`:

#!/bin/sh

#
# Distinguish between normal jobs and an array job.
#

case `echo "$2" | cut -d " " -f 1` in

       Job) JOB_ID=`echo "$2" | cut -d " " -f 2`
            CONDITION=`echo "$2" | cut -d " " -f 4` ;;

Job-array) ARRAY_JOB="1"
            JOB_ID=`echo "$2" | cut -d " " -f 3`
            CONDITION=`echo "$2" | cut -d " " -f 5` ;;

         *) ;;

esac

#
# Check for ERROR state emails which are send before the job starts.
# These are send to the cluster admin.
# Non-Array jobs emails are send by the normal send at the end of the  
script.
#

if [ `echo "$2" | cut -d " " -f 6` == "failed" ]; then

     TASK_ID=`echo "$2" | cut -d " " -f 5`
     if [ ${TASK_ID#*.} -eq 1 ]; then
         mail -s "$2" "$3"
     fi

     exit 0
fi

#
# Check reason for email to the user for -m a.
#

if [ "$CONDITION" = "Set" ]; then
     if [ -n "$ARRAY_JOB" -a ${JOB_ID##*.} -eq 1 -o -z "$ARRAY_JOB" ];  
then
         mail -s "$2" "$3"
     fi
     exit 0
fi

#
# Now send the normal emails.
#

mail -s "$2" "$3"

Pitfall: Don't use spaces in job names as the number fields would  
change.


> of emails I receive when a job fails. This is an issue that happens
> maybe once a month:
>
> ----------------------------------- 
> snippet------------------------------
> ---------------
> Job 4888233 caused action: Job 4888233 set to ERROR

This is no array job for demo purpose?

-- Reuti


> User        = amoustaf
> Queue       = fast.q at dell-3-3-1.jcvi.org
> Start Time  = <unknown>
> End Time    = <unknown>
> failed opening input/output file:04/20/2010 16:07:53 [2846:17364]:
> error: can't open output file "/local/ifs_projects/GOSII/ahmed/phy
> Shepherd trace:
> 04/20/2010 16:07:53 [1132:17363]: shepherd called with uid = 0, euid =
> 1132 04/20/2010 16:07:53 [1132:17363]: csp = 0 04/20/2010 16:07:53
> [1132:17363]: starting up 6.2u3 04/20/2010 16:07:53 [1132:17363]:
> setpgid(17363, 17363) returned 0 04/20/2010 16:07:53 [1132:17364]:
> Child: Starting son(prolog,
> sgeworker@/usr/local/sge_current/jcvi-scripts/prolog  
> dell-3-3-1.jcvi.org
> amoustaf 4888233 targetp fast.q, 0);
>
> I have created a folder in MS Outlook but that not even a band  
> aide  ...
> Above is one of approximately 35,000
> The mail header always read " SGE 6.2u3: Job 4888233 failed ". The
> version of SGE, JobID# and failed, how could I write a wrapper?
> BTW we use MS Exchange any ideas?
>
> Thanks
>
> Bill
>
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wednesday, April 21, 2010 1:09 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] failed job emails configuration
>
> Am 21.04.2010 um 18:52 schrieb wagoodman:
>
>> This is the problem I'm having. I have my SGE set up to send email
> alerts to sgealerts (which is a mailing list that me and another  
> person
> belongs to)so when jobs fail I get notified. However this can be a
> double edge sword, when users submits array jobs (30 to 50,000) this
> brings MS outlook to it's knees, sometimes rendering my PC helpless.  
> Is
> there a configuration to set to send just one email if a batch or  
> array
> job fails, Please help the spam is killing me.
>
> There is already an RFE for it. Do all tasks fail if any fails? It  
> could
> be put into a mail-wrapper, but needs some persistent information of  
> the
> job context to be stored, as the mail-wrapper has no access to the  
> job's
> entries any longer (or send an email by hand inside the job script if
> the error could be trapped, but then only if $SGE_TASK_LAST equals
> $SGE_TASK_ID for the actual job).
>
> Would this be feasible?
>
> -- Reuti
>
>
>> Bill
>>
>> ------------------------------------------------------
>>
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=254359
>>
>> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=254360
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=254365
>
> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net 
> ].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=254376

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=254389

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list