[GE users] How to remove "E" in the queue status

Chris Dagdigian dag at sonsorol.org
Fri Jun 16 11:06:06 BST 2006


The "E" error state usually means that a job died in a spectacular  
manner (possibly taking down the sge_shepherd with it).

SGE persists the E state until it is manually cleared, to prevent a  
"black hole" effect whereby all your pending jobs drain into a  
potentially "bad" machine and all exit quickly with some type of error.

The first thing you should do is examine the cause for the E error.  
If this was a transient error or something that you do not think will  
repeat then you can clear the error state. It is not good to clear  
the E state if it is just going to come back again.

The clear command is "qmod -c" and you can clear your whole cluster  
with " qmod -c '*'  "

Regards,
Chris



On Jun 16, 2006, at 5:57 AM, Yusuf Sun wrote:

> Dear SGE users,
>
> We installed SGE on a small cluster. Recently, "qstat -f" shows
> one node is "E". I guess it means some error on this node.
> I reboot this node and restart sge_execd on this node.
> The "E" is still there. How to find this error and get rid of this  
> "E"?
>
> Thanks
> Y.Sun

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list