[GE users] one node keeps going into error state

David Mathog mathog at mendel.bio.caltech.edu
Mon Nov 22 23:07:15 GMT 2004

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

SGE 5.3.

Master node of beowulf cluster is "safserver", has front end
name "safserver.bio.caltech.edu" and back end network of
"safserver.cluster".  There is also "mendel" with a similar name
arrangement.  Finally there are nodes only in the private network
(monkey01.cluster, etc.)

A few months ago I upgraded the master from RH 7.3 to Mandrake 10.0.
At that time SGE had big problems getting node mendel to work.
It was apparently confused about the path.  Starting safserver
and mendel with:

  sge_commd -a /usr/SGE/default/common/host_aliases

fixed things.

But only for a while.  Today there was an unscheduled shutdown
(fan failure) and now SGE keeps throwing mendel into the
"alarm" state.  It sends this email out:

Job 4338 caused action: All Queues on host "mendel" set to ERROR
User        = safrun
Queue       = testm
Host        = mendel
Start Time  = <unknown>
End Time    = <unknown>
failed before prolog:shepherd exited with exit status 7
Shepherd pe_hostfile:
mendel 1 testm UNDEFINED

Tried upgrading to 5.3p6 and it still did this.
Deleted all queues on mendel, deleted that node. Shut it all
down (safserver and mendel) started it back up, added
the node back, added a queue, qsub.

Now a job sticks at "running" forever in the one queue on mendel.
The user submitting this job:

qsub -q testm showinfo.sh

Sees no files created.

% cat showinfo.sh
echo node `hostname` at `date`

I'm at a loss.   How do I coerce SGE into telling me WHAT
the alarm is?  Failing that, what size nuke is required to
evaporate all the state information everywhere so that this
stuck bit/error/whatever it is will go away?

Note, jobs can be submitted from mendel and run on other nodes
without any problem.


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list