[GE users] cannot run on host until clean up of an previous run has finished

henk h.a.slim at durham.ac.uk
Wed Apr 28 15:49:32 BST 2010


Hi,

I have exactly the same problem with 6.2u5 installed on a new cluster. I
reinstalled gridengine but the problem reoccurred. Did your fix solve
the problem permanently?

Thanks

Henk

> -----Original Message-----
> From: prentice [mailto:prentice at ias.edu]
> Sent: 25 February 2010 15:26
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] cannot run on host until clean up of an
> previous run has finished
> 
> I fixed this by deleting the hosts from SGE and then re-adding them.
> For
> the sake of future victims of a problem like this, here's what I did,
> since there's a few minor gotchas:
> 
> 0. Disable all queues to the affected hosts, and make sure no jobs are
> running on them before starting.
> 
> for host in <list of nodes>; do
>  qmod -d \*@$host
> done
> 
> 1. Wrote the configs of all the hosts to be deleted to text files:
> 
> for host in <list of nodes>; do
>  qconf -se $node > $node.txt
> done
> 
> 2. Edit each text file and remove the entries for "load_values" and
> "processors". These are values calculated by SGE, and will generate
> errors when you try to add the executions hosts back to the config
> later
> on. Since the load_values entry spans multiple lines and may be a
> different number of lines on different hosts, you can't do a simple
sed
> operation to remove the lines. I used vi *.txt top open them all at
> once.
> 
> 3. Edit any host groups or queues that reference the nodes you are
> about
> to delete. You will have to edit the hostgroup @allhosts at a minimum:
> 
> qconf -mhgrp @allhosts
> 
> 4. Delete the missing hosts from SGE:
> 
> for host in <list of nodes>; do
>  qconf -de $node
> done
> 
> 5. Add them back:
> 
> for host in <list of nodes>; do
>  qconf -Ae $node.txt
> done
> 
> 6. Edit the hostgroups  or queues you modified in step 3 to add the
> hosts back
> 
> qconf -mhgrp @allhosts
> 
> That should be it. Be sure to check that the hosts are part of all the
> queues they should be, and that none of the queues are in error.
Enable
> any queues that need it.
> 
> --
> Prentice
> 
> 
> 
> prentice wrote:
> > templedf wrote:
> >> Are you absolutely certain that the offending job isn't continuing
> to
> >> try to run, making it only look like a persistent problem?
> >
> > What do you mean by offending job? The one that won't run because
the
> > other nodes haven't "cleaned up"?
> >
> > I logged into every single node in question, did a 'ps -ef', 'ps -e
> f',
> > top, etc., everything I could to make sure the systems were, in
fact,
> idle.
> >
> >> My only
> >> suggestion would be to put a hold on that job or kill it and
> resubmit it
> >> to see if that helps.
> >
> > You mean the job that won't run because the other nodes haven't
> "cleaned
> > up"?
> >
> > I did 'qmod -rj <jobid>', but nothing happened.
> >
> > I just putt a hold on it, and then removed it after a few minutes -
> no
> > change.
> >
> >> I'm pretty sure that the reschedule unknown list isn't spooled, so
> if
> >> the master is continuing to have the issue, it has to be being
> >> constantly recreated.
> >
> > That's what I suspected, which is I why I shutdown sge_qmaster
before
> > restarting the nodes, so that the erroneous information wouldn't be
> > propagated between the sgeexecd and sge_qmaster processes.
> >> As a stab in the dark, have to tried taking down the entire cluster
> at
> >> once?  That would make sure that no non-persistent state would
> survive.
> >
> > No, and that's not really an option. I have some uncooperative users
> who
> > refuse stop their jobs, and don't use checkpointing. We are
expecting
> a
> > big snowstorm tonight into Friday or Saturday. The odds are good
that
> > we'll lose power, which may force that to happen, anyway. Barring
the
> > forces of nature, rebooting the whole cluster is not an option for
> me.
> >
> > I did try deleting some of the the exec nodes and re-adding them,
but
> > that's a royal pain, since I have to delete them from all the
> hostgroups
> > and queues first, and then remember to add them back. Also, if I do
> > 'qconf -se > savefile' to save the hosts configuration,  I can't do
> > 'qconf -Ae savefile', unless I edit the file to remove the entries
> for
> > load_values and processors.
> >
> >> Daniel
> >>
> >> On 02/24/10 13:57, prentice wrote:
> >>> I'm still getting this error on many of my cluster nodes:
> >>>
> >>> cannot run on host "node64.aurora" until clean up of an previous
> run has
> >>> finished
> >>>
> >>> I've tried just about everything I think of can do diagnose and
fix
> this
> >>> problem:
> >>>
> >>> 1. I restarted the execd daemons on the afflicted nodes
> >>> 2. Restarted sge_qmaster
> >>> 3. Shutdown the afflicted nodes, restarted sge_qmaster, restarted
> >>> afflicted nodes.
> >>> 4. Used 'qmod -f -cq all.q@*'
> >>>
> >>> I checked the spool logs on the server and the nodes (they spool
> dir is
> >>> on a local filesystem for each), and there are no extraneous job
> files.
> >>> In fact, the spool directory is pretty much empty.
> >>>
> >>> I'm using classic spooling, so it can't be a hose bdb file.
> >>>
> >>> The only think I can think of at this point is to delete the queue
> >>> instances and re-add them.
> >>>
> >>> I know this problem was probably caused by someone running a job
> that
> >>> used up all the RAM on these nodes and probably triggered the OOM-
> killer.
> >>>
> >>> Any other ideas?
> >>>
> >>>
> >> ------------------------------------------------------
> >>
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> eId=245958
> >>
> >> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
> >>
> >
> 
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessag
> eId=246068
> 
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255299

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list