[GE users] debugging mailer oddness on SGE 6

Chris Dagdigian dag at sonsorol.org
Wed Nov 30 20:29:55 GMT 2005

I've got a vanilla 6.0u3 SGE system with the standard mailer  
configured as "/usr/bin/mail". Under the hood on each compute node is  
a postfix MTA with a relayhost parameter that relays SMTP traffic  
along to the qmaster node for transit to the public network.

Running  /usr/bin/mail works as expected via the command line just  
fine on each compute node. Local logs show the connection and  
successful relay.

But currently when SGE email notification is requested, the jobs run  
and there is no indication that any sort of mail delivery attempt was  
made at all. The smtp logs on each compute node show no connection/ 
delivery attempts whatsoever.

The only interesting thing in the logs is an older message from a few  
days ago:

> 11/23/2005 13:56:28|execd|xxx|E|mailer had timeout - killing
> 11/23/2005 13:56:28|execd|xxx|E|mailer exited with exit status = 1

Long shot but ...

I'm wondering if a mailer error in SGE is similar to a queue that  
drops into state "E" in that it persists until "something" is done.  
If SGE encounters a fatal mailer error in the past, will it stop  
trying to send email for future jobs? Do I need to restart the execd  
daemons on a compute node or something?

If that is not the case I think my next step is going involve writing  
a custom mailer script that can do some verbose logging. Does anyone  
have a simple mailer script or wrapper that I can use for this  
purpose? All I want to do is drop a custom mailer script in place  
that is capable of logging the fact that it has actually been  
invoked...then it can just pass along its data to /usr/bin/mail as  


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list