[GE users] cannot run on host until clean up of an previous run has finished

prentice prentice at ias.edu
Wed Feb 24 14:52:15 GMT 2010


I searched for any extraneous files on the affected nodes, and couldn't
find any. I did more than just bounce execd - I went whole-hog and
rebooted the nodes. Still no resolution.

Bouncing the master seems like the next logical step. However, I have a
few hundred jobs running on my cluster. Is it safe to bounce the master
without affecting running jobs.

Prentice

templedf wrote:
> There is no explicit way to clear that state that I recall.  I'd have to 
> go look at the source again to remember where exactly that state lives, 
> but you could try bouncing that execd, and if that doesn't clear it, try 
> bouncing the master.
> 
> Daniel
> 
> On 02/24/10 06:08, prentice wrote:
>> This problem bas been going on much longer than 5 minutes. Is there a
>> way to clear this "error"? No error is shown for the queue instance, but
>> jobs aren't running.
>>
>> templedf wrote:
>>    
>>> The "cleanup" really just an excuse.  When a job fails on a host,
>>> there's a timeout (5 minutes, I think) before it's allowed to try
>>> running on that host again.
>>>
>>> Daniel
>>>
>>> On 02/24/10 05:54, prentice wrote:
>>>      
>>>> Dear GU Users,
>>>>
>>>> A couple of weeks ago, that big snowstorm that hit the mid-atlantic took
>>>> out the power to my server room, causing the cluster to go down very
>>>> ungracefully.
>>>>
>>>> Now, a large job can't run because SGE says there's not enough slots for
>>>> the PE. When I do qstat -j<jobid>, I get a lot of messages like this:
>>>>
>>>> cannot run on host "node24.aurora" until clean up of an previous run has
>>>> finished
>>>>
>>>> I'm sure this is leftover from the ungraceful shutdown of SGE. What is
>>>> the best way to "clean up" these previous runs?
>>>>
>>>>
>>>>        
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245864
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>>      
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245870
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245876

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list