[GE users] one node keeps going into error state

David Mathog mathog at mendel.bio.caltech.edu
Tue Nov 23 19:39:39 GMT 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



> On Tue, 23 Nov 2004, David Mathog wrote:
> 
> > > Ah ... actually I meant the administrator abort mail.
> > > That one is more detailed than user mail.
> >
> > Near as I can tell, that WAS the administrator abort mail.

There seems to be only one message being sent:
 
1.  configuration file set up to:

loglevel               log_info

when it is, this appears in mendel's message file;

Tue Nov 23 11:23:02 2004|execd|mendel|I|sending admin mail mail to user
"mathog at mendel.bio.caltech.edu"|mailer "/bin/mailx"|"SGE 5.3p6: Job 4362
failed"

2. "/bin/mailx" does not exist on the linux
system.  It does on the solaris system though, and near as I can
tell the one email message that's going out is going from mendel.
Why is it using mailx though?  The mailer is defined below as
/bin/mail.

The message that is sent has none of the extra information you
cited.

Regards,


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


> >
> > For whatever its worth, here is the config file:
> >
> > %cat /usr/SGE/default/common/configuration
> > conf_version           0
> > qmaster_spool_dir      /usr/SGE/default/spool/qmaster
> > execd_spool_dir        /usr/SGE/default/spool
> > binary_path            /usr/SGE/bin
> > mailer                 /bin/mail
> > xterm                  /usr/bin/X11/xterm
> > load_sensor            none
> > prolog                 none
> > epilog                 none
> > shell_start_mode       posix_compliant
> > login_shells           sh,ksh,csh,tcsh
> > min_uid                0
> > min_gid                0
> > user_lists             none
> > xuser_lists            none
> > load_report_time       00:00:40
> > stat_log_time          48:00:00
> > max_unheard            00:05:00
> > loglevel               log_warning
> > administrator_mail     root at saf.bio.caltech.edu
> > set_token_cmd          none
> > pag_cmd                none
> > token_extend_time      none
> > shepherd_cmd           none
> > qmaster_params         none
> > schedd_params          none
> > execd_params           none
> > finished_jobs          100
> > gid_range              20000-20100
> > admin_user             sgeadm
> > qlogin_command         telnet
> > qlogin_daemon          /usr/sbin/in.telnetd
> > rlogin_daemon          /usr/sbin/in.rlogind
> > default_domain         none
> > ignore_fqdn            true
> >
> > I tried upping the loglevel to log_info but it didn't reveal anything
> > extra and the administrator email was the same.
> >
> 
> As admin mail you've configured
> 
>    administrator_mail     root at saf.bio.caltech.edu
> 
> are you sure you've seen the admin mail? When I receive admin mail it
> looks like this
> 
>    Subject: SGE 6.0u2: Job 298 failed
> 
>    Job 298 caused action: All Queues on host "es-ergb01-01" set to ERROR
>     User        = ah114088
>     Queue       = test at es-ergb01-01
>     Host        = es-ergb01-01
>     Start Time  = <unknown>
>     End Time    = <unknown>
>    failed before prolog:11/23/2004 19:08:00 [115088:14820]: unable to
find prolog file "/false_path"
>    Shepherd trace:
>    11/23/2004 19:08:00 [115088:14818]: shepherd called with uid = 0,
euid = 115088
>    11/23/2004 19:08:00 [115088:14818]: starting up 6.0u2
>    11/23/2004 19:08:00 [115088:14818]: setpgid(14818, 14818) returned 0
>    11/23/2004 19:08:00 [115088:14818]: forked "prolog" with pid 14820
>       :
> 
> though I have not tried admin mail with a 5.3p? system, but I'm also not
> aware about admin mail problems in 5.3. Note the trace file from
sge_shepherd
> you see here is the information we need.
> 
> >
> > Job 4352 caused action: All Queues on host "mendel" set to ERROR
> >  User        = safrun
> >  Queue       = testm
> >  Host        = mendel
> >  Start Time  = <unknown>
> >  End Time    = <unknown>
> > failed before prolog:shepherd exited with exit status 7
> > Shepherd pe_hostfile:
> > mendel 1 testm UNDEFINED
> >
> > what does the line "Shepherd pe_hostfile" indicate???
> 
> It tells the user you job got one slot in testm queue at host
> mendel.
> 
> Andreas
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list