[GE users] Rescheduling emails

Olesen, Mark Mark.Olesen at emcontechnologies.com
Fri Sep 14 07:43:45 BST 2007


    [ The following text is in the "X-UNKNOWN" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi Chris,

> memory/licensing/etc issues, the job is sent back to the queue
> (pe_start
> exit code 99).

Aside:
At the workshop this week, someone pointed out that having a small sleep
before exiting 99 in the prolog might be a good idea to give the condition
to chance to improve before the reschedule job attempts again.

> Is anyone aware of an option to SGE that will mail only on execution
> of the job itself, not the prolog/epilog?

As a workaround, it looks like there is enough information in the emitted
text to determine if the mail message is meaningful:

> Job 1234 (bleh.csh) Rescheduled
>   Exit Status      = -1
...
> failed rescheduling because:
> 09/10/2007 15:20:25 [60055:6044]: exit_status of pe_start = 99


You can configure a mail wrapper in qconf -mconf
     mailer  /opt/n1ge6/default/site/mailwrapper

I've attached an example below. Our simple wrapper adds in some extra
information about a failed job and sends the owner and admin notices,
regardless of them having being requested or not. To adjust it for your
needs, you could simply slurp in the entire STDIN and check for 'Job ...
Rescheduled' and/or 'exit_status = ' expressions before deciding to actually
forward the message or to suppress it instead.

The usual disclaimer about no responsibility for faulty code etc applies.
Let us know what you get working.
/mark

#!/bin/sh
# -*- perl -*-
# $Id: mailwrapper,v 1.2 2005/09/01 07:57:18 eva Exp $
# <settings>
# ---------------------------------------------------------------------
: ${SGE_ROOT:=/opt/n1ge6}
: ${SGE_CELL:=default}
for i in $SGE_ROOT/$SGE_CELL/site/environ; do [ -f $i ] && . $i; done
# ---------------------------------------------------------------------
# </settings>

exec perl -wx $0 "$@"
exit $?

## --------------------------------------------------------------------
#!/usr/bin/perl -w
use strict;
my $mailer = "/bin/mail";

my $job_id;

# catch things like:
#   SGE 6.0u1: Job-array task 11694.1 failed
#   SGE 6.0u1: Job 11819 failed

if ( @ARGV >= 2 and $ARGV[0] eq "-s" ) {
    $ARGV[1] = "[cluster] " . $ARGV[1];
    ($job_id) = $ARGV[1] =~
/\s+Job(?i:-array\s+task)?\s+(\d+)(?:\.\d+|\s+)/;
}

# use qstat to obtain extra info
my %info;
if ($job_id) {
    local @ARGV = "qstat -j $job_id |";
    %info = map { /^(owner|cwd|job_args):\s*(\S.*?)\s*$/ } <>;
}

#
# this is a hack, but avoids sending the same message multiple times
#
my $admin;
my $config = "$ENV{SGE_ROOT}/$ENV{SGE_CELL}/common/configuration";
if ( -f $config ) {
    local @ARGV = $config;
    ($admin) = map { /^\s*administrator_mail\s+(\S+?)(?:,.*)\s*$/ } <>;
}

if ($admin) {
    grep { /(?:^|,|\s)\Q$admin\E(?:,|\s|$)/i } @ARGV or delete $info{owner};
}

#
# send the owner a copy too
#
unshift @ARGV, ( -c => "$info{owner}" ) if $info{owner};

open STDOUT, '|-', "$mailer", @ARGV or die "cannot open mailer '$mailer'\n";
select STDOUT;

if (%info) {
    print "$_: $info{$_}\n" for sort keys %info;
    print "------------------------------\n";
}

print while <STDIN>;

exit 0;

__END__
# ---------------------------------------------------------------------
SGE calls the "mailer" program of the global/local cluster config to send
mail. The way how the mailer is called is (more or less):

    <cat> mail_body | <mailer> -s "subjectline" recipient_list

where "<cat>" means that the body of the mail is sent to stdin of the mailer
by the execd.
This e-mail message and any attachments may contain legally privileged, confidential or proprietary Information, or information otherwise protected by law of EMCON Technologies, its affiliates, or third parties. This notice serves as marking of its ?Confidential? status as defined in any confidentiality agreements concerning the sender and recipient. If you are not the intended recipient(s), or the employee or agent responsible for delivery of this message to the intended recipient(s), you are hereby notified that any dissemination, distribution or copying of this e-mail message is strictly prohibited. If you have received this message in error, please immediately notify the sender and delete this e-mail message from your computer.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list