[GE users] cannot run on host until clean up of an previous run has finished

prentice prentice at ias.edu
Wed Feb 24 21:57:05 GMT 2010


I'm still getting this error on many of my cluster nodes:

cannot run on host "node64.aurora" until clean up of an previous run has
finished

I've tried just about everything I think of can do diagnose and fix this
problem:

1. I restarted the execd daemons on the afflicted nodes
2. Restarted sge_qmaster
3. Shutdown the afflicted nodes, restarted sge_qmaster, restarted
afflicted nodes.
4. Used 'qmod -f -cq all.q@*'

I checked the spool logs on the server and the nodes (they spool dir is
on a local filesystem for each), and there are no extraneous job files.
In fact, the spool directory is pretty much empty.

I'm using classic spooling, so it can't be a hose bdb file.

The only think I can think of at this point is to delete the queue
instances and re-add them.

I know this problem was probably caused by someone running a job that
used up all the RAM on these nodes and probably triggered the OOM-killer.

Any other ideas?

-- 
Prentice

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245957

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list