[GE users] cannot run on host until clean up of an previous run has finished

prentice prentice at ias.edu
Thu Feb 25 15:25:50 GMT 2010


I fixed this by deleting the hosts from SGE and then re-adding them. For
the sake of future victims of a problem like this, here's what I did,
since there's a few minor gotchas:

0. Disable all queues to the affected hosts, and make sure no jobs are
running on them before starting.

for host in <list of nodes>; do
 qmod -d \*@$host
done

1. Wrote the configs of all the hosts to be deleted to text files:

for host in <list of nodes>; do
 qconf -se $node > $node.txt
done

2. Edit each text file and remove the entries for "load_values" and
"processors". These are values calculated by SGE, and will generate
errors when you try to add the executions hosts back to the config later
on. Since the load_values entry spans multiple lines and may be a
different number of lines on different hosts, you can't do a simple sed
operation to remove the lines. I used vi *.txt top open them all at once.

3. Edit any host groups or queues that reference the nodes you are about
to delete. You will have to edit the hostgroup @allhosts at a minimum:

qconf -mhgrp @allhosts

4. Delete the missing hosts from SGE:

for host in <list of nodes>; do
 qconf -de $node
done

5. Add them back:

for host in <list of nodes>; do
 qconf -Ae $node.txt
done

6. Edit the hostgroups  or queues you modified in step 3 to add the
hosts back

qconf -mhgrp @allhosts

That should be it. Be sure to check that the hosts are part of all the
queues they should be, and that none of the queues are in error. Enable
any queues that need it.

--
Prentice



prentice wrote:
> templedf wrote:
>> Are you absolutely certain that the offending job isn't continuing to 
>> try to run, making it only look like a persistent problem?  
> 
> What do you mean by offending job? The one that won't run because the
> other nodes haven't "cleaned up"?
> 
> I logged into every single node in question, did a 'ps -ef', 'ps -e f',
> top, etc., everything I could to make sure the systems were, in fact, idle.
> 
>> My only 
>> suggestion would be to put a hold on that job or kill it and resubmit it 
>> to see if that helps.
> 
> You mean the job that won't run because the other nodes haven't "cleaned
> up"?
> 
> I did 'qmod -rj <jobid>', but nothing happened.
> 
> I just putt a hold on it, and then removed it after a few minutes - no
> change.
> 
>> I'm pretty sure that the reschedule unknown list isn't spooled, so if 
>> the master is continuing to have the issue, it has to be being 
>> constantly recreated.
> 
> That's what I suspected, which is I why I shutdown sge_qmaster before
> restarting the nodes, so that the erroneous information wouldn't be
> propagated between the sgeexecd and sge_qmaster processes.
>> As a stab in the dark, have to tried taking down the entire cluster at 
>> once?  That would make sure that no non-persistent state would survive.
> 
> No, and that's not really an option. I have some uncooperative users who
> refuse stop their jobs, and don't use checkpointing. We are expecting a
> big snowstorm tonight into Friday or Saturday. The odds are good that
> we'll lose power, which may force that to happen, anyway. Barring the
> forces of nature, rebooting the whole cluster is not an option for me.
> 
> I did try deleting some of the the exec nodes and re-adding them, but
> that's a royal pain, since I have to delete them from all the hostgroups
> and queues first, and then remember to add them back. Also, if I do
> 'qconf -se > savefile' to save the hosts configuration,  I can't do
> 'qconf -Ae savefile', unless I edit the file to remove the entries for
> load_values and processors.
> 
>> Daniel
>>
>> On 02/24/10 13:57, prentice wrote:
>>> I'm still getting this error on many of my cluster nodes:
>>>
>>> cannot run on host "node64.aurora" until clean up of an previous run has
>>> finished
>>>
>>> I've tried just about everything I think of can do diagnose and fix this
>>> problem:
>>>
>>> 1. I restarted the execd daemons on the afflicted nodes
>>> 2. Restarted sge_qmaster
>>> 3. Shutdown the afflicted nodes, restarted sge_qmaster, restarted
>>> afflicted nodes.
>>> 4. Used 'qmod -f -cq all.q@*'
>>>
>>> I checked the spool logs on the server and the nodes (they spool dir is
>>> on a local filesystem for each), and there are no extraneous job files.
>>> In fact, the spool directory is pretty much empty.
>>>
>>> I'm using classic spooling, so it can't be a hose bdb file.
>>>
>>> The only think I can think of at this point is to delete the queue
>>> instances and re-add them.
>>>
>>> I know this problem was probably caused by someone running a job that
>>> used up all the RAM on these nodes and probably triggered the OOM-killer.
>>>
>>> Any other ideas?
>>>
>>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245958
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=246068

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list