[GE users] cannot run on host until clean up of an previous run has finished

templedf dan.templeton at sun.com
Wed Feb 24 22:02:11 GMT 2010


Are you absolutely certain that the offending job isn't continuing to 
try to run, making it only look like a persistent problem?  My only 
suggestion would be to put a hold on that job or kill it and resubmit it 
to see if that helps.

I'm pretty sure that the reschedule unknown list isn't spooled, so if 
the master is continuing to have the issue, it has to be being 
constantly recreated.

As a stab in the dark, have to tried taking down the entire cluster at 
once?  That would make sure that no non-persistent state would survive.

Daniel

On 02/24/10 13:57, prentice wrote:
> I'm still getting this error on many of my cluster nodes:
>
> cannot run on host "node64.aurora" until clean up of an previous run has
> finished
>
> I've tried just about everything I think of can do diagnose and fix this
> problem:
>
> 1. I restarted the execd daemons on the afflicted nodes
> 2. Restarted sge_qmaster
> 3. Shutdown the afflicted nodes, restarted sge_qmaster, restarted
> afflicted nodes.
> 4. Used 'qmod -f -cq all.q@*'
>
> I checked the spool logs on the server and the nodes (they spool dir is
> on a local filesystem for each), and there are no extraneous job files.
> In fact, the spool directory is pretty much empty.
>
> I'm using classic spooling, so it can't be a hose bdb file.
>
> The only think I can think of at this point is to delete the queue
> instances and re-add them.
>
> I know this problem was probably caused by someone running a job that
> used up all the RAM on these nodes and probably triggered the OOM-killer.
>
> Any other ideas?
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245958

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list