[GE users] Diagnosis on a queue
landman at scalableinformatics.com
Fri May 15 14:09:35 BST 2009
On Fri, May 15, 2009 at 8:51 AM, mad <margaret_Doll at brown.edu> wrote:
> > Why would queues not distribute the jobs evenly? Or how should I go
> > about finding the problem?
> > Why does one compute node in particular have problems with the queues?
For a particular job with job_id
qstat -j job_id
should give you some details on what the job state is. So if you have
jobs that put the queue into an error state, look at those jobs using
> > I have various groups of users on a ROCKS system where we are using
> > Grid Engine 6.1u4 to manage the queues. I have different queues for
> > each of the groups. I am having problems only with one of the compute
> > nodes on the chemistry and het queues.
If you are using Rocks, chances are you are using automount rather
than hard mounts. We typically move our customers to hard mounts, as
we have had many problems with automount delays triggering queue error
states. It's generally easy to do in Rocks, as you have one of a few
different sources of home directories.
FWIW: this one aspect has been our largest headache with Rocks
GridEngine interaction. Slow automount for any reason tosses the
queue into an error state.
You can manually clear the error using
qmod -cq het at compute-0-33.local
qmod -cq chemistry at compute-0-33.local
Try a hard mount first, and see if this fixes these two nodes.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users