[GE users] debugging mailer oddness on SGE 6

Reuti reuti at staff.uni-marburg.de
Wed Nov 30 20:59:26 GMT 2005


Hi Chris,

Am 30.11.2005 um 21:29 schrieb Chris Dagdigian:

>
> I've got a vanilla 6.0u3 SGE system with the standard mailer  
> configured as "/usr/bin/mail". Under the hood on each compute node  
> is a postfix MTA with a relayhost parameter that relays SMTP  
> traffic along to the qmaster node for transit to the public network.

I've the same configuration, but I translate the sender address  
already on the nodes to be the one of the head node of the cluster,  
because otherwise the mails wouldn't make their way through the  
faculty, as the internal names of the cluster-nodes are unknown to  
any DNS which is used by the relays in the way.

> Running  /usr/bin/mail works as expected via the command line just  
> fine on each compute node. Local logs show the connection and  
> successful relay.
>
> But currently when SGE email notification is requested, the jobs  
> run and there is no indication that any sort of mail delivery  
> attempt was made at all. The smtp logs on each compute node show no  
> connection/delivery attempts whatsoever.
>
> The only interesting thing in the logs is an older message from a  
> few days ago:
>
>> 11/23/2005 13:56:28|execd|xxx|E|mailer had timeout - killing
>> 11/23/2005 13:56:28|execd|xxx|E|mailer exited with exit status = 1
>
> Long shot but ...

Nothing in /var/log/mail besides your commandlines tests? When I  
configured postfix, I had to stop/start postfix after the changes, as  
a reload wasn't enough to get all changes accepted by postfix. But I  
never had the issue that SGE gave up to use it (u4 & u6).

>
> I'm wondering if a mailer error in SGE is similar to a queue that  
> drops into state "E" in that it persists until "something" is done.  
> If SGE encounters a fatal mailer error in the past, will it stop  
> trying to send email for future jobs? Do I need to restart the  
> execd daemons on a compute node or something?

I would

stop & start postfix
stop & start execd

on one node and check what's happening. - Reuti

> If that is not the case I think my next step is going involve  
> writing a custom mailer script that can do some verbose logging.  
> Does anyone have a simple mailer script or wrapper that I can use  
> for this purpose? All I want to do is drop a custom mailer script  
> in place that is capable of logging the fact that it has actually  
> been invoked...then it can just pass along its data to /usr/bin/ 
> mail as expected.
>
> Regards,
> Chris
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list