[GE users] qstat state AU

Chris Dagdigian dag at sonsorol.org
Tue Jun 20 03:24:10 BST 2006


Peiran,

The "E" state is more of a concern than the 'au' state.

'au' simply means that Grid Engine is likely not running on the node.  
The "a" means 'alarm' and the "u" means unheard/unreachable. The  
combination of the two more often than not means that SGE is not  
running on the compute node.

E is a worse state to see. It means that there was a major problem on  
the compute node (with the system or the job itself). SGE  
intentionally marked the queue as state "E" so that other jobs would  
not run into the same bad problem.

E states do not go away automatically, even if you reboot the  
cluster. Once you think the cluster is fine you can use the "qmod"  
command to clear the E state.

-Chris




On Jun 19, 2006, at 9:47 PM, McCalla, Mac wrote:

> Hi,
> I don't know anything about apple systems, but the E state for the  
> q's is error, look in the messages file under the $sge_root/$cell/ 
> qmaster directory for messages indicating the job causing the  
> problem.  The au state normally indicates the sge exec daemon which  
> should be running on the execution host is not and nees to be  
> restarted.
>
> HTH,
> Mac McCalla
> Mac McCalla
> --------------------------
> Sent from my BlackBerry Wireless Handheld
>
>
> -----Original Message-----
> From: Peiran Song <peirans at cs.uoregon.edu>
> To: users at gridengine.sunsource.net <users at gridengine.sunsource.net>
> Sent: Mon Jun 19 17:41:07 2006
> Subject: [GE users] qstat state AU
>
> Hi All,
>
> Our Apple cluster running Grid Engine 6 is sick, the "qstat -f" output
> is like:
>
> queuename                      qtype used/tot. load_avg arch
> states
> ---------------------------------------------------------------------- 
> ------
>
> all.q at genomix.cs.uoregon.edu   BIP   0/2       0.03      
> darwin        E
> ---------------------------------------------------------------------- 
> ------
>
> all.q at node001.cluster.private  BIP   0/2       0.09      
> darwin        E
> ---------------------------------------------------------------------- 
> ------
>
> all.q at node002.cluster.private  BIP   2/2       -NA-      
> darwin        au
>   8086 0.55500 J19260.zfi peirans      r     06/04/2006  
> 18:08:05     1 1
>   8086 0.55500 J19260.zfi peirans      r     06/04/2006  
> 18:08:05     1 2
> ---------------------------------------------------------------------- 
> ------
>
> all.q at node003.cluster.private  BIP   0/2       0.00      
> darwin        E
> ---------------------------------------------------------------------- 
> ------
>
> all.q at node004.cluster.private  BIP   0/2       0.00      
> darwin        E
> ---------------------------------------------------------------------- 
> ------
>
> all.q at node005.cluster.private  BIP   0/2       0.00      
> darwin        E
>
> ...  Followed by a long and growing pending list.
>
> What is the way to tackle "au" states?
>
> Any input would be appreciated!
>
> Regards,
> Peiran
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list