[GE users] qstat Eqw seems to be related to NFS slowness...

mbay2002 jeff at haferman.com
Mon Aug 24 23:16:46 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

This is Centos 5.2, though according to the other messages in the thread
this seems to be a known issue.  


tmacmd wrote:
> What operating system and version is being used?
> 
> --tmac
> 
> RedHat Certified Engineer #804006984323821 (RHEL4)
> RedHat Certified Engineer #805007643429572 (RHEL5)
> 
> Principal Consultant
> 
> 
> 
> 
> On Fri, Aug 21, 2009 at 4:28 PM, mbay2002<jeff at haferman.com> wrote:
>> We've got a cluster where /home is a ZFS filsystem, and the rest of our
>> filesystems are lustre.
>>
>> What I've been noticing is that when users submit array jobs through
>> qsub (it seems they have to have 200 or more instances for this to
>> occur), some of the jobs error out (qstat shows "Eqw").
>>
>> When I inspect the error, it shows that /home does not exist on some of
>> the nodes.  /home is automounted, so it doesn't appear until a user
>> connects to a node.  What I've found is that if I clear the error
>> (qmod -cj <job-number> ) the jobs usually take off and complete.
>>
>> So, what I'm thinking is that when roughly 200 (or more) array jobs are
>> submitted, NFS / automount can't simultaneously mount /home/<user>
>> across all the nodes.  MOST of them work, but a few error out.  Perhaps
>> it is more appropriate to post this to a ZFS forum, but, has anyone else
>> seen this behavior, and if so, is there a fix?
>>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214049

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list