[GE users] qstat state AU

Peiran Song peirans at cs.uoregon.edu
Tue Jun 20 19:23:20 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Many thanks to Mac, Chris, and Johnny for the hints and tips.

I traced the root cause:

06/18/2006 18:36:19|qmaster|genomix|E|error ending a transaction: (28) 
No space left on device
06/18/2006 18:36:19|qmaster|genomix|W|rule "default rule" in spooling 
context "berkeleydb spooling" failed writing an object
06/18/2006 18:36:19|qmaster|genomix|E|error writing object "peirans" to 
spooling database
06/18/2006 18:36:20|qmaster|genomix|E|orders user/project version 
(1211243) is not uptodate (1211244) for user/project "peirans"
06/18/2006 18:36:21|qmaster|genomix|E|orders user/project version 
(1211243) is not uptodate (1211244) for user/project "peirans"
06/18/2006 18:36:22|qmaster|genomix|E|orders user/project version 
(1211243) is not uptodate (1211244) for user/project "peirans"
06/18/2006 18:36:23|qmaster|genomix|E|orders user/project version 
(1211243) is not uptodate (1211244) for user/project "peirans"
06/18/2006 18:40:53|qmaster|genomix|W|rescheduling job 9210.7
06/18/2006 23:00:03|qmaster|genomix|W|scheduler tries to change tickets 
of a non running job 9210 task 7(state 0)
06/18/2006 23:00:03|qmaster|genomix|W|scheduler tries to change tickets 
of a non running job 9210 task 9(state 0)
06/18/2006 23:00:03|qmaster|genomix|W|scheduler tries to change tickets 
of a non running job 9210 task 10(state 0)
06/18/2006 23:00:03|qmaster|genomix|E|orders user/project version 
(1211243) is not uptodate (1211244) for user/project "peirans"
...

We freed out space on the device, deleted the jobs at node002 and 
restart SGE using

SystemStarter stop SGE
SystemStarter start SGE

then rebooted compute nodes. node002 was not accepting ssh connection and so power-cycled and rebooted. Then cleared the E state at head node. 

Now qstat output is clean, but qmaster messages continues to printout:


06/20/2006 10:59:11|qmaster|genomix|E|scheduler tries to schedule job 
9323.1 twice
06/20/2006 10:59:11|qmaster|genomix|E|orders user/project version (502) 
is not uptodate (503) for user/project "www"
06/20/2006 10:59:13|qmaster|genomix|E|orders user/project version (503) 
is not uptodate (504) for user/project "www"
......
06/20/2006 10:59:51|qmaster|genomix|E|orders user/project version (522) 
is not uptodate (523) for user/project "www"
06/20/2006 10:59:52|qmaster|genomix|E|unable to find job 9323 from the 
scheduler order package


Will these grow to a blocking state again?

Thanks!
Peiran






McCalla, Mac wrote:
> Hi,
> I don't know anything about apple systems, but the E state for the q's is error, look in the messages file under the $sge_root/$cell/qmaster directory for messages indicating the job causing the problem.  The au state normally indicates the sge exec daemon which should be running on the execution host is not and nees to be restarted. 
>
> HTH,
> Mac McCalla
> Mac McCalla
> --------------------------
> Sent from my BlackBerry Wireless Handheld
>
>
> -----Original Message-----
> From: Peiran Song <peirans at cs.uoregon.edu>
> To: users at gridengine.sunsource.net <users at gridengine.sunsource.net>
> Sent: Mon Jun 19 17:41:07 2006
> Subject: [GE users] qstat state AU
>
> Hi All,
>
> Our Apple cluster running Grid Engine 6 is sick, the "qstat -f" output 
> is like:
>
> queuename                      qtype used/tot. load_avg arch          
> states
> ---------------------------------------------------------------------------- 
>
> all.q at genomix.cs.uoregon.edu   BIP   0/2       0.03     darwin        E
> ---------------------------------------------------------------------------- 
>
> all.q at node001.cluster.private  BIP   0/2       0.09     darwin        E
> ---------------------------------------------------------------------------- 
>
> all.q at node002.cluster.private  BIP   2/2       -NA-     darwin        au
>   8086 0.55500 J19260.zfi peirans      r     06/04/2006 18:08:05     1 1
>   8086 0.55500 J19260.zfi peirans      r     06/04/2006 18:08:05     1 2
> ---------------------------------------------------------------------------- 
>
> all.q at node003.cluster.private  BIP   0/2       0.00     darwin        E
> ---------------------------------------------------------------------------- 
>
> all.q at node004.cluster.private  BIP   0/2       0.00     darwin        E
> ---------------------------------------------------------------------------- 
>
> all.q at node005.cluster.private  BIP   0/2       0.00     darwin        E
>
> ...  Followed by a long and growing pending list.
>
> What is the way to tackle "au" states?
>
> Any input would be appreciated!
>
> Regards,
> Peiran
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list