[GE users] cannot run on host until clean up of an previous run has finished

prentice prentice at ias.edu
Wed Feb 24 16:17:33 GMT 2010


I just restarted sge_qmaster using the sgemaster startup script:

service sgemaster.aurora stop
service sgemaster.aurora start

Still getting the same errors:

cannot run on host "node19.aurora" until clean up of an previous run has
finished
cannot run on host "node26.aurora" until clean up of an previous run has
finished

Any other ideas?

Do I need to reboot the entire master node? I'd rather not do that,
since my master node provides some network services  (IB subnet
mamanger, etc,) to the cluster and acts as a gateway to the outside
world for others (name services, etc. Not sure what would happen if
those services disappeared.

Prentice


templedf wrote:
> There is no explicit way to clear that state that I recall.  I'd have to 
> go look at the source again to remember where exactly that state lives, 
> but you could try bouncing that execd, and if that doesn't clear it, try 
> bouncing the master.
> 
> Daniel
> 
> On 02/24/10 06:08, prentice wrote:
>> This problem bas been going on much longer than 5 minutes. Is there a
>> way to clear this "error"? No error is shown for the queue instance, but
>> jobs aren't running.
>>
>> templedf wrote:
>>    
>>> The "cleanup" really just an excuse.  When a job fails on a host,
>>> there's a timeout (5 minutes, I think) before it's allowed to try
>>> running on that host again.
>>>
>>> Daniel
>>>
>>> On 02/24/10 05:54, prentice wrote:
>>>      
>>>> Dear GU Users,
>>>>
>>>> A couple of weeks ago, that big snowstorm that hit the mid-atlantic took
>>>> out the power to my server room, causing the cluster to go down very
>>>> ungracefully.
>>>>
>>>> Now, a large job can't run because SGE says there's not enough slots for
>>>> the PE. When I do qstat -j<jobid>, I get a lot of messages like this:
>>>>
>>>> cannot run on host "node24.aurora" until clean up of an previous run has
>>>> finished
>>>>
>>>> I'm sure this is leftover from the ungraceful shutdown of SGE. What is
>>>> the best way to "clean up" these previous runs?
>>>>
>>>>
>>>>        
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245864
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>>      
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245870
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245896

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list