[GE users] cannot run on host until clean up of an previous run has finished

prentice prentice at ias.edu
Wed Feb 24 22:44:48 GMT 2010


templedf wrote:
> Are you absolutely certain that the offending job isn't continuing to 
> try to run, making it only look like a persistent problem?  

What do you mean by offending job? The one that won't run because the
other nodes haven't "cleaned up"?

I logged into every single node in question, did a 'ps -ef', 'ps -e f',
top, etc., everything I could to make sure the systems were, in fact, idle.

>My only 
> suggestion would be to put a hold on that job or kill it and resubmit it 
> to see if that helps.

You mean the job that won't run because the other nodes haven't "cleaned
up"?

I did 'qmod -rj <jobid>', but nothing happened.

I just putt a hold on it, and then removed it after a few minutes - no
change.

> 
> I'm pretty sure that the reschedule unknown list isn't spooled, so if 
> the master is continuing to have the issue, it has to be being 
> constantly recreated.

That's what I suspected, which is I why I shutdown sge_qmaster before
restarting the nodes, so that the erroneous information wouldn't be
propagated between the sgeexecd and sge_qmaster processes.
> 
> As a stab in the dark, have to tried taking down the entire cluster at 
> once?  That would make sure that no non-persistent state would survive.

No, and that's not really an option. I have some uncooperative users who
refuse stop their jobs, and don't use checkpointing. We are expecting a
big snowstorm tonight into Friday or Saturday. The odds are good that
we'll lose power, which may force that to happen, anyway. Barring the
forces of nature, rebooting the whole cluster is not an option for me.

I did try deleting some of the the exec nodes and re-adding them, but
that's a royal pain, since I have to delete them from all the hostgroups
and queues first, and then remember to add them back. Also, if I do
'qconf -se > savefile' to save the hosts configuration,  I can't do
'qconf -Ae savefile', unless I edit the file to remove the entries for
load_values and processors.

> 
> Daniel
> 
> On 02/24/10 13:57, prentice wrote:
>> I'm still getting this error on many of my cluster nodes:
>>
>> cannot run on host "node64.aurora" until clean up of an previous run has
>> finished
>>
>> I've tried just about everything I think of can do diagnose and fix this
>> problem:
>>
>> 1. I restarted the execd daemons on the afflicted nodes
>> 2. Restarted sge_qmaster
>> 3. Shutdown the afflicted nodes, restarted sge_qmaster, restarted
>> afflicted nodes.
>> 4. Used 'qmod -f -cq all.q@*'
>>
>> I checked the spool logs on the server and the nodes (they spool dir is
>> on a local filesystem for each), and there are no extraneous job files.
>> In fact, the spool directory is pretty much empty.
>>
>> I'm using classic spooling, so it can't be a hose bdb file.
>>
>> The only think I can think of at this point is to delete the queue
>> instances and re-add them.
>>
>> I know this problem was probably caused by someone running a job that
>> used up all the RAM on these nodes and probably triggered the OOM-killer.
>>
>> Any other ideas?
>>
>>
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245958
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245963

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list