[GE users] qstat Eqw seems to be related to NFS slowness...

mbay2002 jeff at haferman.com
Fri Aug 21 21:28:01 BST 2009


We've got a cluster where /home is a ZFS filsystem, and the rest of our
filesystems are lustre.

What I've been noticing is that when users submit array jobs through
qsub (it seems they have to have 200 or more instances for this to
occur), some of the jobs error out (qstat shows "Eqw").

When I inspect the error, it shows that /home does not exist on some of
the nodes.  /home is automounted, so it doesn't appear until a user
connects to a node.  What I've found is that if I clear the error
(qmod -cj <job-number> ) the jobs usually take off and complete.

So, what I'm thinking is that when roughly 200 (or more) array jobs are
submitted, NFS / automount can't simultaneously mount /home/<user>
across all the nodes.  MOST of them work, but a few error out.  Perhaps
it is more appropriate to post this to a ZFS forum, but, has anyone else
seen this behavior, and if so, is there a fix?

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213530

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list