[GE users] qstat Eqw seems to be related to NFS slowness...

jpierce jonathan.pierce at loni.ucla.edu
Fri Aug 21 22:38:27 BST 2009


I can confirm this exact behavior. We tried a number of options to automount, 
but ultimately had to switch back to persistent NFS mounts.

On 8/21/09 1:28 PM, mbay2002 wrote:
> We've got a cluster where /home is a ZFS filsystem, and the rest of our
> filesystems are lustre.
>
> What I've been noticing is that when users submit array jobs through
> qsub (it seems they have to have 200 or more instances for this to
> occur), some of the jobs error out (qstat shows "Eqw").
>
> When I inspect the error, it shows that /home does not exist on some of
> the nodes.  /home is automounted, so it doesn't appear until a user
> connects to a node.  What I've found is that if I clear the error
> (qmod -cj<job-number>  ) the jobs usually take off and complete.
>
> So, what I'm thinking is that when roughly 200 (or more) array jobs are
> submitted, NFS / automount can't simultaneously mount /home/<user>
> across all the nodes.  MOST of them work, but a few error out.  Perhaps
> it is more appropriate to post this to a ZFS forum, but, has anyone else
> seen this behavior, and if so, is there a fix?
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213530
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

-- 
Jonathan Pierce
Systems Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South,
Suite 225 Los Angeles, CA 90095-7334
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonathan.pierce at loni.ucla.edu

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213542

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list