[GE users] one node keeps going into error state

David Mathog mathog at mendel.bio.caltech.edu
Tue Nov 23 16:01:21 GMT 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



> Hi David,
> 
> Anything informative in the spool log files?

Yes, right after the client does this:

 qsub -q testm showinfo.sh

these lines appear:

> 
> /usr/SGE/default/spool/qmaster/messages
Tue Nov 23 07:57:52 2004|qmaster|safserver|W|job 4343.1 failed on host
mendel general  before prolog because: shepherd exited with exit status 7
Tue Nov 23 07:57:52 2004|qmaster|safserver|W|rescheduling job 4343.1
Tue Nov 23 07:57:52 2004|qmaster|safserver|E|queue testm marked QERROR
as result of job 4343's failure
Tue Nov 23 07:57:52 2004|qmaster|safserver|E|queue testm marked QERROR
as result of job 4343's failure at host mendel

>/usr/SGE/default/spool/qmaster/schedd/messages

nothing useful here, just start up and shut down messages

> 
> And especially:
> 
> /usr/SGE/default/spool/mendel/messages


Tue Nov 23 07:57:52 2004|execd|mendel|E|shepherd of job 4343.1 exited
with exit status = 7
Tue Nov 23 07:57:52 2004|execd|mendel|E|reaping job "4343" ptf
complains: Job does not exist
Tue Nov 23 07:57:52 2004|execd|mendel|E|"abnormal termination of
shepherd for job 4343.1: no "exit_status" file"
Tue Nov 23 07:57:52 2004|execd|mendel|E|cant open file
active_jobs/4343.1/error: No such file or directory
Tue Nov 23 07:57:52 2004|execd|mendel|E|can't open pid file
"active_jobs/4343.1/pid" for job 4343.1

Which is greek to me.  WHY does the shepherd terminate abnormally?

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
> 
> -Chris
> 
> 
> 
> David Mathog wrote:
> 
> > SGE 5.3.
> > 
> > Master node of beowulf cluster is "safserver", has front end
> > name "safserver.bio.caltech.edu" and back end network of
> > "safserver.cluster".  There is also "mendel" with a similar name
> > arrangement.  Finally there are nodes only in the private network
> > (monkey01.cluster, etc.)
> > 
> > A few months ago I upgraded the master from RH 7.3 to Mandrake 10.0.
> > At that time SGE had big problems getting node mendel to work.
> > It was apparently confused about the path.  Starting safserver
> > and mendel with:
> > 
> >   sge_commd -a /usr/SGE/default/common/host_aliases
> > 
> > fixed things.
> > 
> > But only for a while.  Today there was an unscheduled shutdown
> > (fan failure) and now SGE keeps throwing mendel into the
> > "alarm" state.  It sends this email out:
> > 
> > Job 4338 caused action: All Queues on host "mendel" set to ERROR
> > User        = safrun
> > Queue       = testm
> > Host        = mendel
> > Start Time  = <unknown>
> > End Time    = <unknown>
> > failed before prolog:shepherd exited with exit status 7
> > Shepherd pe_hostfile:
> > mendel 1 testm UNDEFINED
> > 
> > Tried upgrading to 5.3p6 and it still did this.
> > Deleted all queues on mendel, deleted that node. Shut it all
> > down (safserver and mendel) started it back up, added
> > the node back, added a queue, qsub.
> > 
> > Now a job sticks at "running" forever in the one queue on mendel.
> > The user submitting this job:
> > 
> > qsub -q testm showinfo.sh
> > 
> > Sees no files created.
> > 
> > % cat showinfo.sh
> > #!/bin/sh
> > echo node `hostname` at `date`
> > 
> > I'm at a loss.   How do I coerce SGE into telling me WHAT
> > the alarm is?  Failing that, what size nuke is required to
> > evaporate all the state information everywhere so that this
> > stuck bit/error/whatever it is will go away?
> > 
> > Note, jobs can be submitted from mendel and run on other nodes
> > without any problem.
> > 
> > Thanks,
> > 
> > David Mathog
> > mathog at caltech.edu
> > Manager, Sequence Analysis Facility, Biology Division, Caltech
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> -- 
> Chris Dagdigian, <dag at sonsorol.org>
> BioTeam  - Independent life science IT & informatics consulting
> Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193
> PGP KeyID: 83D4310E iChat/AIM: bioteamdag  Web: http://bioteam.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list