[GE users] NSF write errors on nodes not accepting jobs

Ravi Chandra Nallan Ravichandra.Nallan at Sun.COM
Mon Dec 17 06:12:34 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

qstat -explain E should give you some hints regd the problem (you can 
also check the messages file for the exec host).
Looks like some NFS issue, should have been in this state due to a job 
failure.
regards,
~Ravi

Chris Dagdigian wrote:
>
> Your 2 nodes that are not accepting jobs are showing state code "E" 
> which is explained in the docs and in the manpage for "qstat".
>
> State code E usually means something nasty has happened on the node, 
> something bad enough to take out the shepherd process in many cases. 
> Usually this is a system, OS, filesystem or authentication problem 
> that will cause all jobs landing on the system to exit with errors.
>
> In these situations, to prevent a "black hole" effect where these 
> nodes continually accept jobs, only to have them exit with error -- 
> the system configures a persistent state ("E").
>
> The state will persist until manually cleared by a Grid Engine 
> administrator, reboots will have no effect.
>
> If you are satisfied that the error that caused the E state is fixed 
> (probably the NFS issues mentioned) then you can use the 'qmod' 
> command to clear the error states and your nodes will start accepting 
> jobs again.
>
> The command "qmod -c '*'" should do the trick.
>
>
> Regards,
> Chris
>
>
>
>
> On Dec 16, 2007, at 5:52 PM, FL wrote:
>
>> Here is a problem with a Sun Solaris cluster, on which two nodes stopped
>> accepting jobs.
>>
>>> There is a problem with now 2 cluster nodes. They seem to be up and
>>> running but they won't accept any jobs from the queue.
>>>
>>> ...
>>> ---------------------------------------------------------------------------- 
>>>
>>> all.q at compute-1-10             BIP   0/4       0.00     sol-amd64     E
>>> ---------------------------------------------------------------------------- 
>>>
>>> all.q at compute-1-11             BIP   4/4       3.00     sol-amd64
>>> 337201 0.50442 run_QA1    alex         r     11/25/2007 00:03:47     1
>>> 337204 0.50442 run_QA4    alex         r     11/25/2007 00:06:02     1
>>> 347212 0.50070 reduce.m23 wbackes      r     12/14/2007 08:12:32     1
>>> 349855 0.51000 gridMathem jeff         r     11/30/2007 13:53:02     1
>>> ---------------------------------------------------------------------------- 
>>>
>>> all.q at compute-1-12             BIP   0/4       0.00     sol-amd64     E
>>> ---------------------------------------------------------------------------- 
>>>
>>> ...
>>>
>>> I tryed to reboot one of the nodes, but it did not solve the problems.
>>> I saw the following messages:
>>>
>>> NFS write error on host n1sm: I/O error.
>>> (file handle: 154000a 2 a a0fb6 52e56c84 a a0fb6 r52e56c84 0)
>>> eNFS write erroron host n1sm: I/O error.
>>> b(ofile handle: 154000a 2 a a0fb6 52e56c84 a a0fb6 52e56c84 0)
>>> NFS write error on host n1sm: I/O error.
>>> o(tfile handle: 154000a 2 a a0fb6 52e56c84 a a0fb6 52e56c84 0)
>>> NFS write error on host n1sm: I/O error.
>>> i(file handle: 14n000a 2 a a0fb652e56c84 a a0fb6 52e56c84 0)
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list