[GE users] cannot run on host until clean up of an previous run has finished

reuti reuti at staff.uni-marburg.de
Sun May 2 17:35:46 BST 2010


Hi,

Am 28.04.2010 um 16:49 schrieb henk:

> I have exactly the same problem with 6.2u5 installed on a new  
> cluster. I
> reinstalled gridengine but the problem reoccurred. Did your fix solve
> the problem permanently.

sometimes it might also help to get rid of the complete directory  
structure for each exechost which is created per job:

/usr/sge/spool/node01/jobs/00/0000

i.e. empty all /usr/sge/spool/node01/jobs while SGE is shutdown and no  
jobs on the nodes.

-- Reuti


> Thanks
>
> Henk
>
>> -----Original Message-----
>> From: prentice [mailto:prentice at ias.edu]
>> Sent: 25 February 2010 15:26
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] cannot run on host until clean up of an
>> previous run has finished
>>
>> I fixed this by deleting the hosts from SGE and then re-adding them.
>> For
>> the sake of future victims of a problem like this, here's what I did,
>> since there's a few minor gotchas:
>>
>> 0. Disable all queues to the affected hosts, and make sure no jobs  
>> are
>> running on them before starting.
>>
>> for host in <list of nodes>; do
>> qmod -d \*@$host
>> done
>>
>> 1. Wrote the configs of all the hosts to be deleted to text files:
>>
>> for host in <list of nodes>; do
>> qconf -se $node > $node.txt
>> done
>>
>> 2. Edit each text file and remove the entries for "load_values" and
>> "processors". These are values calculated by SGE, and will generate
>> errors when you try to add the executions hosts back to the config
>> later
>> on. Since the load_values entry spans multiple lines and may be a
>> different number of lines on different hosts, you can't do a simple
> sed
>> operation to remove the lines. I used vi *.txt top open them all at
>> once.
>>
>> 3. Edit any host groups or queues that reference the nodes you are
>> about
>> to delete. You will have to edit the hostgroup @allhosts at a  
>> minimum:
>>
>> qconf -mhgrp @allhosts
>>
>> 4. Delete the missing hosts from SGE:
>>
>> for host in <list of nodes>; do
>> qconf -de $node
>> done
>>
>> 5. Add them back:
>>
>> for host in <list of nodes>; do
>> qconf -Ae $node.txt
>> done
>>
>> 6. Edit the hostgroups  or queues you modified in step 3 to add the
>> hosts back
>>
>> qconf -mhgrp @allhosts
>>
>> That should be it. Be sure to check that the hosts are part of all  
>> the
>> queues they should be, and that none of the queues are in error.
> Enable
>> any queues that need it.
>>
>> --
>> Prentice
>>
>>
>>
>> prentice wrote:
>>> templedf wrote:
>>>> Are you absolutely certain that the offending job isn't continuing
>> to
>>>> try to run, making it only look like a persistent problem?
>>>
>>> What do you mean by offending job? The one that won't run because
> the
>>> other nodes haven't "cleaned up"?
>>>
>>> I logged into every single node in question, did a 'ps -ef', 'ps -e
>> f',
>>> top, etc., everything I could to make sure the systems were, in
> fact,
>> idle.
>>>
>>>> My only
>>>> suggestion would be to put a hold on that job or kill it and
>> resubmit it
>>>> to see if that helps.
>>>
>>> You mean the job that won't run because the other nodes haven't
>> "cleaned
>>> up"?
>>>
>>> I did 'qmod -rj <jobid>', but nothing happened.
>>>
>>> I just putt a hold on it, and then removed it after a few minutes -
>> no
>>> change.
>>>
>>>> I'm pretty sure that the reschedule unknown list isn't spooled, so
>> if
>>>> the master is continuing to have the issue, it has to be being
>>>> constantly recreated.
>>>
>>> That's what I suspected, which is I why I shutdown sge_qmaster
> before
>>> restarting the nodes, so that the erroneous information wouldn't be
>>> propagated between the sgeexecd and sge_qmaster processes.
>>>> As a stab in the dark, have to tried taking down the entire cluster
>> at
>>>> once?  That would make sure that no non-persistent state would
>> survive.
>>>
>>> No, and that's not really an option. I have some uncooperative users
>> who
>>> refuse stop their jobs, and don't use checkpointing. We are
> expecting
>> a
>>> big snowstorm tonight into Friday or Saturday. The odds are good
> that
>>> we'll lose power, which may force that to happen, anyway. Barring
> the
>>> forces of nature, rebooting the whole cluster is not an option for
>> me.
>>>
>>> I did try deleting some of the the exec nodes and re-adding them,
> but
>>> that's a royal pain, since I have to delete them from all the
>> hostgroups
>>> and queues first, and then remember to add them back. Also, if I do
>>> 'qconf -se > savefile' to save the hosts configuration,  I can't do
>>> 'qconf -Ae savefile', unless I edit the file to remove the entries
>> for
>>> load_values and processors.
>>>
>>>> Daniel
>>>>
>>>> On 02/24/10 13:57, prentice wrote:
>>>>> I'm still getting this error on many of my cluster nodes:
>>>>>
>>>>> cannot run on host "node64.aurora" until clean up of an previous
>> run has
>>>>> finished
>>>>>
>>>>> I've tried just about everything I think of can do diagnose and
> fix
>> this
>>>>> problem:
>>>>>
>>>>> 1. I restarted the execd daemons on the afflicted nodes
>>>>> 2. Restarted sge_qmaster
>>>>> 3. Shutdown the afflicted nodes, restarted sge_qmaster, restarted
>>>>> afflicted nodes.
>>>>> 4. Used 'qmod -f -cq all.q@*'
>>>>>
>>>>> I checked the spool logs on the server and the nodes (they spool
>> dir is
>>>>> on a local filesystem for each), and there are no extraneous job
>> files.
>>>>> In fact, the spool directory is pretty much empty.
>>>>>
>>>>> I'm using classic spooling, so it can't be a hose bdb file.
>>>>>
>>>>> The only think I can think of at this point is to delete the queue
>>>>> instances and re-add them.
>>>>>
>>>>> I know this problem was probably caused by someone running a job
>> that
>>>>> used up all the RAM on these nodes and probably triggered the OOM-
>> killer.
>>>>>
>>>>> Any other ideas?
>>>>>
>>>>>
>>>> ------------------------------------------------------
>>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessag
>> eId=245958
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>
>> ------------------------------------------------------
>>
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessag
>> eId=246068
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255299
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255819

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list