[GE users] cannot run on host until clean up of an previous run has finished

reuti reuti at staff.uni-marburg.de
Wed Feb 24 14:54:43 GMT 2010


HI,

Am 24.02.2010 um 15:52 schrieb prentice:

> I searched for any extraneous files on the affected nodes, and  
> couldn't
> find any. I did more than just bounce execd - I went whole-hog and
> rebooted the nodes. Still no resolution.
>
> Bouncing the master seems like the next logical step. However, I  
> have a
> few hundred jobs running on my cluster. Is it safe to bounce the  
> master
> without affecting running jobs.

is there something left in the spool directory of the nodes;  
especially in the "jobs" subdirectory?

-- Reuti


> Prentice
>
> templedf wrote:
>> There is no explicit way to clear that state that I recall.  I'd  
>> have to
>> go look at the source again to remember where exactly that state  
>> lives,
>> but you could try bouncing that execd, and if that doesn't clear  
>> it, try
>> bouncing the master.
>>
>> Daniel
>>
>> On 02/24/10 06:08, prentice wrote:
>>> This problem bas been going on much longer than 5 minutes. Is  
>>> there a
>>> way to clear this "error"? No error is shown for the queue  
>>> instance, but
>>> jobs aren't running.
>>>
>>> templedf wrote:
>>>
>>>> The "cleanup" really just an excuse.  When a job fails on a host,
>>>> there's a timeout (5 minutes, I think) before it's allowed to try
>>>> running on that host again.
>>>>
>>>> Daniel
>>>>
>>>> On 02/24/10 05:54, prentice wrote:
>>>>
>>>>> Dear GU Users,
>>>>>
>>>>> A couple of weeks ago, that big snowstorm that hit the mid- 
>>>>> atlantic took
>>>>> out the power to my server room, causing the cluster to go down  
>>>>> very
>>>>> ungracefully.
>>>>>
>>>>> Now, a large job can't run because SGE says there's not enough  
>>>>> slots for
>>>>> the PE. When I do qstat -j<jobid>, I get a lot of messages like  
>>>>> this:
>>>>>
>>>>> cannot run on host "node24.aurora" until clean up of an  
>>>>> previous run has
>>>>> finished
>>>>>
>>>>> I'm sure this is leftover from the ungraceful shutdown of SGE.  
>>>>> What is
>>>>> the best way to "clean up" these previous runs?
>>>>>
>>>>>
>>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>>> dsForumId=38&dsMessageId=245864
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=245870
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>
> -- 
> Prentice Bisbal
> Linux Software Support Specialist/System Administrator
> School of Natural Sciences
> Institute for Advanced Study
> Princeton, NJ
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=245876
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245878

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list