[GE users] qstat state AU

Reuti reuti at staff.uni-marburg.de
Tue Jun 20 19:57:08 BST 2006


Am 20.06.2006 um 20:23 schrieb Peiran Song:

> Many thanks to Mac, Chris, and Johnny for the hints and tips.
>
> I traced the root cause:
>
> 06/18/2006 18:36:19|qmaster|genomix|E|error ending a transaction:  
> (28) No space left on device
> 06/18/2006 18:36:19|qmaster|genomix|W|rule "default rule" in  
> spooling context "berkeleydb spooling" failed writing an object
> 06/18/2006 18:36:19|qmaster|genomix|E|error writing object  
> "peirans" to spooling database
> 06/18/2006 18:36:20|qmaster|genomix|E|orders user/project version  
> (1211243) is not uptodate (1211244) for user/project "peirans"
> 06/18/2006 18:36:21|qmaster|genomix|E|orders user/project version  
> (1211243) is not uptodate (1211244) for user/project "peirans"
> 06/18/2006 18:36:22|qmaster|genomix|E|orders user/project version  
> (1211243) is not uptodate (1211244) for user/project "peirans"
> 06/18/2006 18:36:23|qmaster|genomix|E|orders user/project version  
> (1211243) is not uptodate (1211244) for user/project "peirans"
> 06/18/2006 18:40:53|qmaster|genomix|W|rescheduling job 9210.7
> 06/18/2006 23:00:03|qmaster|genomix|W|scheduler tries to change  
> tickets of a non running job 9210 task 7(state 0)
> 06/18/2006 23:00:03|qmaster|genomix|W|scheduler tries to change  
> tickets of a non running job 9210 task 9(state 0)
> 06/18/2006 23:00:03|qmaster|genomix|W|scheduler tries to change  
> tickets of a non running job 9210 task 10(state 0)
> 06/18/2006 23:00:03|qmaster|genomix|E|orders user/project version  
> (1211243) is not uptodate (1211244) for user/project "peirans"
> ...
>
> We freed out space on the device, deleted the jobs at node002 and  
> restart SGE using
>
> SystemStarter stop SGE
> SystemStarter start SGE
>
> then rebooted compute nodes. node002 was not accepting ssh  
> connection and so power-cycled and rebooted. Then cleared the E  
> state at head node.
> Now qstat output is clean, but qmaster messages continues to printout:
>
>
> 06/20/2006 10:59:11|qmaster|genomix|E|scheduler tries to schedule  
> job 9323.1 twice
> 06/20/2006 10:59:11|qmaster|genomix|E|orders user/project version  
> (502) is not uptodate (503) for user/project "www"
> 06/20/2006 10:59:13|qmaster|genomix|E|orders user/project version  
> (503) is not uptodate (504) for user/project "www"
> ......
> 06/20/2006 10:59:51|qmaster|genomix|E|orders user/project version  
> (522) is not uptodate (523) for user/project "www"

This you can try to remove by:

$ qconf -muser www

Just change one character, even a space and exit vi with x. - Reuti


> 06/20/2006 10:59:52|qmaster|genomix|E|unable to find job 9323 from  
> the scheduler order package
>
>
> Will these grow to a blocking state again?
>
> Thanks!
> Peiran
>
>
>
>
>
>
> McCalla, Mac wrote:
>> Hi,
>> I don't know anything about apple systems, but the E state for the  
>> q's is error, look in the messages file under the $sge_root/$cell/ 
>> qmaster directory for messages indicating the job causing the  
>> problem.  The au state normally indicates the sge exec daemon  
>> which should be running on the execution host is not and nees to  
>> be restarted.
>> HTH,
>> Mac McCalla
>> Mac McCalla
>> --------------------------
>> Sent from my BlackBerry Wireless Handheld
>>
>>
>> -----Original Message-----
>> From: Peiran Song <peirans at cs.uoregon.edu>
>> To: users at gridengine.sunsource.net <users at gridengine.sunsource.net>
>> Sent: Mon Jun 19 17:41:07 2006
>> Subject: [GE users] qstat state AU
>>
>> Hi All,
>>
>> Our Apple cluster running Grid Engine 6 is sick, the "qstat -f"  
>> output is like:
>>
>> queuename                      qtype used/tot. load_avg  
>> arch          states
>> --------------------------------------------------------------------- 
>> -------
>> all.q at genomix.cs.uoregon.edu   BIP   0/2       0.03      
>> darwin        E
>> --------------------------------------------------------------------- 
>> -------
>> all.q at node001.cluster.private  BIP   0/2       0.09      
>> darwin        E
>> --------------------------------------------------------------------- 
>> -------
>> all.q at node002.cluster.private  BIP   2/2       -NA-      
>> darwin        au
>>   8086 0.55500 J19260.zfi peirans      r     06/04/2006  
>> 18:08:05     1 1
>>   8086 0.55500 J19260.zfi peirans      r     06/04/2006  
>> 18:08:05     1 2
>> --------------------------------------------------------------------- 
>> -------
>> all.q at node003.cluster.private  BIP   0/2       0.00      
>> darwin        E
>> --------------------------------------------------------------------- 
>> -------
>> all.q at node004.cluster.private  BIP   0/2       0.00      
>> darwin        E
>> --------------------------------------------------------------------- 
>> -------
>> all.q at node005.cluster.private  BIP   0/2       0.00      
>> darwin        E
>>
>> ...  Followed by a long and growing pending list.
>>
>> What is the way to tackle "au" states?
>>
>> Any input would be appreciated!
>>
>> Regards,
>> Peiran
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list