[GE users] cannot run on host until clean up of an previous run has finished

templedf dan.templeton at sun.com
Wed Feb 24 14:23:13 GMT 2010


There is no explicit way to clear that state that I recall.  I'd have to 
go look at the source again to remember where exactly that state lives, 
but you could try bouncing that execd, and if that doesn't clear it, try 
bouncing the master.

Daniel

On 02/24/10 06:08, prentice wrote:
> This problem bas been going on much longer than 5 minutes. Is there a
> way to clear this "error"? No error is shown for the queue instance, but
> jobs aren't running.
>
> templedf wrote:
>    
>> The "cleanup" really just an excuse.  When a job fails on a host,
>> there's a timeout (5 minutes, I think) before it's allowed to try
>> running on that host again.
>>
>> Daniel
>>
>> On 02/24/10 05:54, prentice wrote:
>>      
>>> Dear GU Users,
>>>
>>> A couple of weeks ago, that big snowstorm that hit the mid-atlantic took
>>> out the power to my server room, causing the cluster to go down very
>>> ungracefully.
>>>
>>> Now, a large job can't run because SGE says there's not enough slots for
>>> the PE. When I do qstat -j<jobid>, I get a lot of messages like this:
>>>
>>> cannot run on host "node24.aurora" until clean up of an previous run has
>>> finished
>>>
>>> I'm sure this is leftover from the ungraceful shutdown of SGE. What is
>>> the best way to "clean up" these previous runs?
>>>
>>>
>>>        
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245864
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>>      
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245870

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list