[GE users] Queues are in error state

Mark Ellerby issmde at leeds.ac.uk
Tue May 3 10:35:31 BST 2005


We have had SGE 6.0 installed on our Linux beowulf cluster for a couple 
of weeks and it has been working OK. However when I came back to work 
following the long weekend I found most queues to be in error state. 
Having looked through the qmaster/messages file the problem seemed to 
start when job 982 failed. The error message is as follows:

04/30/2005 21:44:28|qmaster|snowdon|W|job 982.1 failed on host 
snowdon.leeds.ac.uk general assumedly before j
ob because: can't write script file "job_scripts/982" wrote only -1 of 
4451552 bytes: Bad address

I can't find any record of that job unfortunately, so I can't see what 
it was trying to run.

The strangest thing is, it seems that SGE tried to then run that job on 
pretty much all the queues in the system, putting most of our compute 
nodes out of action (in error state). I can't understand why the 
queueing system would do this, because it doesn't normally do that when 
a job fails.

Could this be a bug on SGE6.0, or could it be that I've not set up the 
queueing system correctly?

Any help appreciated


Mark Ellerby                         email: m.d.ellerby at leeds.ac.uk
Information Systems Services         phone: +44 (0)113 3435429
University of Leeds

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list