[GE users] one node keeps going into error state

Andreas Haas Andreas.Haas at Sun.COM
Wed Nov 24 09:22:27 GMT 2004


You seem to have no local configuration for host mendel

   # qconf -sconf mendel

architecture specific settings such as mailer are set if
execution daemons are installed via $SGE_ROOT/install_execd
and proposed defaults are used. Try specifying for mendel

   mailer                       /bin/mail
   xterm                        /usr/bin/X11/xterm
   qlogin_daemon                /usr/sbin/in.telnetd
   rlogin_daemon                /usr/sbin/in.rlogind

using

   # qconf -mconf mendel.

Andreas

On Tue, 23 Nov 2004, David Mathog wrote:

>
>
> > On Tue, 23 Nov 2004, David Mathog wrote:
> >
> > > > Ah ... actually I meant the administrator abort mail.
> > > > That one is more detailed than user mail.
> > >
> > > Near as I can tell, that WAS the administrator abort mail.
>
> There seems to be only one message being sent:
>
> 1.  configuration file set up to:
>
> loglevel               log_info
>
> when it is, this appears in mendel's message file;
>
> Tue Nov 23 11:23:02 2004|execd|mendel|I|sending admin mail mail to user
> "mathog at mendel.bio.caltech.edu"|mailer "/bin/mailx"|"SGE 5.3p6: Job 4362
> failed"
>
> 2. "/bin/mailx" does not exist on the linux
> system.  It does on the solaris system though, and near as I can
> tell the one email message that's going out is going from mendel.
> Why is it using mailx though?  The mailer is defined below as
> /bin/mail.
>
> The message that is sent has none of the extra information you
> cited.
>
> Regards,
>
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
>
>
> > >
> > > For whatever its worth, here is the config file:
> > >
> > > %cat /usr/SGE/default/common/configuration
> > > conf_version           0
> > > qmaster_spool_dir      /usr/SGE/default/spool/qmaster
> > > execd_spool_dir        /usr/SGE/default/spool
> > > binary_path            /usr/SGE/bin
> > > mailer                 /bin/mail
> > > xterm                  /usr/bin/X11/xterm
> > > load_sensor            none
> > > prolog                 none
> > > epilog                 none
> > > shell_start_mode       posix_compliant
> > > login_shells           sh,ksh,csh,tcsh
> > > min_uid                0
> > > min_gid                0
> > > user_lists             none
> > > xuser_lists            none
> > > load_report_time       00:00:40
> > > stat_log_time          48:00:00
> > > max_unheard            00:05:00
> > > loglevel               log_warning
> > > administrator_mail     root at saf.bio.caltech.edu
> > > set_token_cmd          none
> > > pag_cmd                none
> > > token_extend_time      none
> > > shepherd_cmd           none
> > > qmaster_params         none
> > > schedd_params          none
> > > execd_params           none
> > > finished_jobs          100
> > > gid_range              20000-20100
> > > admin_user             sgeadm
> > > qlogin_command         telnet
> > > qlogin_daemon          /usr/sbin/in.telnetd
> > > rlogin_daemon          /usr/sbin/in.rlogind
> > > default_domain         none
> > > ignore_fqdn            true
> > >
> > > I tried upping the loglevel to log_info but it didn't reveal anything
> > > extra and the administrator email was the same.
> > >
> >
> > As admin mail you've configured
> >
> >    administrator_mail     root at saf.bio.caltech.edu
> >
> > are you sure you've seen the admin mail? When I receive admin mail it
> > looks like this
> >
> >    Subject: SGE 6.0u2: Job 298 failed
> >
> >    Job 298 caused action: All Queues on host "es-ergb01-01" set to ERROR
> >     User        = ah114088
> >     Queue       = test at es-ergb01-01
> >     Host        = es-ergb01-01
> >     Start Time  = <unknown>
> >     End Time    = <unknown>
> >    failed before prolog:11/23/2004 19:08:00 [115088:14820]: unable to
> find prolog file "/false_path"
> >    Shepherd trace:
> >    11/23/2004 19:08:00 [115088:14818]: shepherd called with uid = 0,
> euid = 115088
> >    11/23/2004 19:08:00 [115088:14818]: starting up 6.0u2
> >    11/23/2004 19:08:00 [115088:14818]: setpgid(14818, 14818) returned 0
> >    11/23/2004 19:08:00 [115088:14818]: forked "prolog" with pid 14820
> >       :
> >
> > though I have not tried admin mail with a 5.3p? system, but I'm also not
> > aware about admin mail problems in 5.3. Note the trace file from
> sge_shepherd
> > you see here is the information we need.
> >
> > >
> > > Job 4352 caused action: All Queues on host "mendel" set to ERROR
> > >  User        = safrun
> > >  Queue       = testm
> > >  Host        = mendel
> > >  Start Time  = <unknown>
> > >  End Time    = <unknown>
> > > failed before prolog:shepherd exited with exit status 7
> > > Shepherd pe_hostfile:
> > > mendel 1 testm UNDEFINED
> > >
> > > what does the line "Shepherd pe_hostfile" indicate???
> >
> > It tells the user you job got one slot in testm queue at host
> > mendel.
> >
> > Andreas
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list