[GE users] one node keeps going into error state

David Mathog mathog at mendel.bio.caltech.edu
Tue Nov 23 17:43:45 GMT 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



> Which userid is the daemons running under?? If you login as that user, can
> you acces the execd's spool directory??

On the compute nodes, including mendel, which is having the
problem with its queues executing:

user    daemon
sgeadm  sge_execd
root    sge_commd

The script in question runs ok:

% su - sgeadm
./showinfo.sh

Another detail I forgot to post previously, for this:

%# on mendel
% /usr/SGE/bin/solaris64/sge_commd \
  -a /usr/SGE/default/common/host_aliases
% cat /usr/SGE/default/common/host_aliases
safserver safserver.cluster
mendel    mendel.cluster 
% grep mendel.cluster /etc/hosts
192.168.1.230   mendel.cluster
% grep safserver.cluster /etc/hosts
192.168.1.220   safserver.cluster       safserver

sge_commd seems to be ok since qsub etc. run ok
on mendel.  Some problem with sge_shepherd, apparently.
What does status "7" for this program mean???


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
> 
> Rayson
> 
> 
> >> Hi David,
> >> 
> >> Anything informative in the spool log files?
> >
> >Yes, right after the client does this:
> >
> > qsub -q testm showinfo.sh
> >
> >these lines appear:
> >
> >> 
> >> /usr/SGE/default/spool/qmaster/messages
> >Tue Nov 23 07:57:52 2004|qmaster|safserver|W|job 4343.1 failed on host
> >mendel general  before prolog because: shepherd exited with exit status 7
> >Tue Nov 23 07:57:52 2004|qmaster|safserver|W|rescheduling job 4343.1
> >Tue Nov 23 07:57:52 2004|qmaster|safserver|E|queue testm marked QERROR
> >as result of job 4343's failure
> >Tue Nov 23 07:57:52 2004|qmaster|safserver|E|queue testm marked QERROR
> >as result of job 4343's failure at host mendel
> >
> >>/usr/SGE/default/spool/qmaster/schedd/messages
> >
> >nothing useful here, just start up and shut down messages
> >
> >> 
> >> And especially:
> >> 
> >> /usr/SGE/default/spool/mendel/messages
> >
> >
> >Tue Nov 23 07:57:52 2004|execd|mendel|E|shepherd of job 4343.1 exited
> >with exit status = 7
> >Tue Nov 23 07:57:52 2004|execd|mendel|E|reaping job "4343" ptf
> >complains: Job does not exist
> >Tue Nov 23 07:57:52 2004|execd|mendel|E|"abnormal termination of
> >shepherd for job 4343.1: no "exit_status" file"
> >Tue Nov 23 07:57:52 2004|execd|mendel|E|cant open file
> >active_jobs/4343.1/error: No such file or directory
> >Tue Nov 23 07:57:52 2004|execd|mendel|E|can't open pid file
> >"active_jobs/4343.1/pid" for job 4343.1
> >
> ---------------------------------------------------------
> Get your FREE E-mail account at http://www.eseenet.com !
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list