[GE users] one node keeps going into error state

Andreas Haas Andreas.Haas at Sun.COM
Tue Nov 23 18:16:44 GMT 2004


On Tue, 23 Nov 2004, David Mathog wrote:

> > Ah ... actually I meant the administrator abort mail.
> > That one is more detailed than user mail.
>
> Near as I can tell, that WAS the administrator abort mail.
>
> For whatever its worth, here is the config file:
>
> %cat /usr/SGE/default/common/configuration
> conf_version           0
> qmaster_spool_dir      /usr/SGE/default/spool/qmaster
> execd_spool_dir        /usr/SGE/default/spool
> binary_path            /usr/SGE/bin
> mailer                 /bin/mail
> xterm                  /usr/bin/X11/xterm
> load_sensor            none
> prolog                 none
> epilog                 none
> shell_start_mode       posix_compliant
> login_shells           sh,ksh,csh,tcsh
> min_uid                0
> min_gid                0
> user_lists             none
> xuser_lists            none
> load_report_time       00:00:40
> stat_log_time          48:00:00
> max_unheard            00:05:00
> loglevel               log_warning
> administrator_mail     root at saf.bio.caltech.edu
> set_token_cmd          none
> pag_cmd                none
> token_extend_time      none
> shepherd_cmd           none
> qmaster_params         none
> schedd_params          none
> execd_params           none
> finished_jobs          100
> gid_range              20000-20100
> admin_user             sgeadm
> qlogin_command         telnet
> qlogin_daemon          /usr/sbin/in.telnetd
> rlogin_daemon          /usr/sbin/in.rlogind
> default_domain         none
> ignore_fqdn            true
>
> I tried upping the loglevel to log_info but it didn't reveal anything
> extra and the administrator email was the same.
>

As admin mail you've configured

   administrator_mail     root at saf.bio.caltech.edu

are you sure you've seen the admin mail? When I receive admin mail it
looks like this

   Subject: SGE 6.0u2: Job 298 failed

   Job 298 caused action: All Queues on host "es-ergb01-01" set to ERROR
    User        = ah114088
    Queue       = test at es-ergb01-01
    Host        = es-ergb01-01
    Start Time  = <unknown>
    End Time    = <unknown>
   failed before prolog:11/23/2004 19:08:00 [115088:14820]: unable to find prolog file "/false_path"
   Shepherd trace:
   11/23/2004 19:08:00 [115088:14818]: shepherd called with uid = 0, euid = 115088
   11/23/2004 19:08:00 [115088:14818]: starting up 6.0u2
   11/23/2004 19:08:00 [115088:14818]: setpgid(14818, 14818) returned 0
   11/23/2004 19:08:00 [115088:14818]: forked "prolog" with pid 14820
      :

though I have not tried admin mail with a 5.3p? system, but I'm also not
aware about admin mail problems in 5.3. Note the trace file from sge_shepherd
you see here is the information we need.

>
> Job 4352 caused action: All Queues on host "mendel" set to ERROR
>  User        = safrun
>  Queue       = testm
>  Host        = mendel
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed before prolog:shepherd exited with exit status 7
> Shepherd pe_hostfile:
> mendel 1 testm UNDEFINED
>
> what does the line "Shepherd pe_hostfile" indicate???

It tells the user you job got one slot in testm queue at host
mendel.

Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list