[GE users] one node keeps going into error state

Chris Dagdigian dag at sonsorol.org
Mon Nov 22 23:21:41 GMT 2004


Hi David,

Anything informative in the spool log files?

/usr/SGE/default/spool/qmaster/messages
/usr/SGE/default/spool/qmaster/schedd/messages

And especially:

/usr/SGE/default/spool/mendel/messages

-Chris



David Mathog wrote:

> SGE 5.3.
> 
> Master node of beowulf cluster is "safserver", has front end
> name "safserver.bio.caltech.edu" and back end network of
> "safserver.cluster".  There is also "mendel" with a similar name
> arrangement.  Finally there are nodes only in the private network
> (monkey01.cluster, etc.)
> 
> A few months ago I upgraded the master from RH 7.3 to Mandrake 10.0.
> At that time SGE had big problems getting node mendel to work.
> It was apparently confused about the path.  Starting safserver
> and mendel with:
> 
>   sge_commd -a /usr/SGE/default/common/host_aliases
> 
> fixed things.
> 
> But only for a while.  Today there was an unscheduled shutdown
> (fan failure) and now SGE keeps throwing mendel into the
> "alarm" state.  It sends this email out:
> 
> Job 4338 caused action: All Queues on host "mendel" set to ERROR
> User        = safrun
> Queue       = testm
> Host        = mendel
> Start Time  = <unknown>
> End Time    = <unknown>
> failed before prolog:shepherd exited with exit status 7
> Shepherd pe_hostfile:
> mendel 1 testm UNDEFINED
> 
> Tried upgrading to 5.3p6 and it still did this.
> Deleted all queues on mendel, deleted that node. Shut it all
> down (safserver and mendel) started it back up, added
> the node back, added a queue, qsub.
> 
> Now a job sticks at "running" forever in the one queue on mendel.
> The user submitting this job:
> 
> qsub -q testm showinfo.sh
> 
> Sees no files created.
> 
> % cat showinfo.sh
> #!/bin/sh
> echo node `hostname` at `date`
> 
> I'm at a loss.   How do I coerce SGE into telling me WHAT
> the alarm is?  Failing that, what size nuke is required to
> evaporate all the state information everywhere so that this
> stuck bit/error/whatever it is will go away?
> 
> Note, jobs can be submitted from mendel and run on other nodes
> without any problem.
> 
> Thanks,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

-- 
Chris Dagdigian, <dag at sonsorol.org>
BioTeam  - Independent life science IT & informatics consulting
Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E iChat/AIM: bioteamdag  Web: http://bioteam.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list