[GE users] NSF write errors on nodes not accepting jobs

Chris Dagdigian dag at sonsorol.org
Sun Dec 16 23:36:23 GMT 2007


Your 2 nodes that are not accepting jobs are showing state code "E"  
which is explained in the docs and in the manpage for "qstat".

State code E usually means something nasty has happened on the node,  
something bad enough to take out the shepherd process in many cases.  
Usually this is a system, OS, filesystem or authentication problem  
that will cause all jobs landing on the system to exit with errors.

In these situations, to prevent a "black hole" effect where these  
nodes continually accept jobs, only to have them exit with error --  
the system configures a persistent state ("E").

The state will persist until manually cleared by a Grid Engine  
administrator, reboots will have no effect.

If you are satisfied that the error that caused the E state is fixed  
(probably the NFS issues mentioned) then you can use the 'qmod'  
command to clear the error states and your nodes will start accepting  
jobs again.

The command "qmod -c '*'" should do the trick.


Regards,
Chris




On Dec 16, 2007, at 5:52 PM, FL wrote:

> Here is a problem with a Sun Solaris cluster, on which two nodes  
> stopped
> accepting jobs.
>
>> There is a problem with now 2 cluster nodes. They seem to be up and
>> running but they won't accept any jobs from the queue.
>>
>> ...
>> ----------------------------------------------------------------------------
>> all.q at compute-1-10             BIP   0/4       0.00     sol- 
>> amd64     E
>> ----------------------------------------------------------------------------
>> all.q at compute-1-11             BIP   4/4       3.00     sol-amd64
>> 337201 0.50442 run_QA1    alex         r     11/25/2007  
>> 00:03:47     1
>> 337204 0.50442 run_QA4    alex         r     11/25/2007  
>> 00:06:02     1
>> 347212 0.50070 reduce.m23 wbackes      r     12/14/2007  
>> 08:12:32     1
>> 349855 0.51000 gridMathem jeff         r     11/30/2007  
>> 13:53:02     1
>> ----------------------------------------------------------------------------
>> all.q at compute-1-12             BIP   0/4       0.00     sol- 
>> amd64     E
>> ----------------------------------------------------------------------------
>> ...
>>
>> I tryed to reboot one of the nodes, but it did not solve the  
>> problems.
>> I saw the following messages:
>>
>> NFS write error on host n1sm: I/O error.
>> (file handle: 154000a 2 a a0fb6 52e56c84 a a0fb6 r52e56c84 0)
>> eNFS write erroron host n1sm: I/O error.
>> b(ofile handle: 154000a 2 a a0fb6 52e56c84 a a0fb6 52e56c84 0)
>> NFS write error on host n1sm: I/O error.
>> o(tfile handle: 154000a 2 a a0fb6 52e56c84 a a0fb6 52e56c84 0)
>> NFS write error on host n1sm: I/O error.
>> i(file handle: 14n000a 2 a a0fb652e56c84 a a0fb6 52e56c84 0)
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list