[GE users] cannot run on host until clean up of an previous run has finished

prentice prentice at ias.edu
Wed Feb 24 14:59:25 GMT 2010


reuti wrote:
> HI,
> 
> Am 24.02.2010 um 15:52 schrieb prentice:
> 
>> I searched for any extraneous files on the affected nodes, and  
>> couldn't
>> find any. I did more than just bounce execd - I went whole-hog and
>> rebooted the nodes. Still no resolution.
>>
>> Bouncing the master seems like the next logical step. However, I  
>> have a
>> few hundred jobs running on my cluster. Is it safe to bounce the  
>> master
>> without affecting running jobs.
> 
> is there something left in the spool directory of the nodes;  
> especially in the "jobs" subdirectory?

No, It's completely empty on all the nodes I've checked so far:

cd /var/local/sge/default/spool/

find .
.
./node01
./node01/execd.pid
./node01/jobs
./node01/job_scripts
./node01/active_jobs
./node01/messages


>>
>> templedf wrote:
>>> There is no explicit way to clear that state that I recall.  I'd  
>>> have to
>>> go look at the source again to remember where exactly that state  
>>> lives,
>>> but you could try bouncing that execd, and if that doesn't clear  
>>> it, try
>>> bouncing the master.
>>>
>>> Daniel
>>>
>>> On 02/24/10 06:08, prentice wrote:
>>>> This problem bas been going on much longer than 5 minutes. Is  
>>>> there a
>>>> way to clear this "error"? No error is shown for the queue  
>>>> instance, but
>>>> jobs aren't running.
>>>>
>>>> templedf wrote:
>>>>
>>>>> The "cleanup" really just an excuse.  When a job fails on a host,
>>>>> there's a timeout (5 minutes, I think) before it's allowed to try
>>>>> running on that host again.
>>>>>
>>>>> Daniel
>>>>>
>>>>> On 02/24/10 05:54, prentice wrote:
>>>>>
>>>>>> Dear GU Users,
>>>>>>
>>>>>> A couple of weeks ago, that big snowstorm that hit the mid- 
>>>>>> atlantic took
>>>>>> out the power to my server room, causing the cluster to go down  
>>>>>> very
>>>>>> ungracefully.
>>>>>>
>>>>>> Now, a large job can't run because SGE says there's not enough  
>>>>>> slots for
>>>>>> the PE. When I do qstat -j<jobid>, I get a lot of messages like  
>>>>>> this:
>>>>>>
>>>>>> cannot run on host "node24.aurora" until clean up of an  
>>>>>> previous run has
>>>>>> finished
>>>>>>
>>>>>> I'm sure this is leftover from the ungraceful shutdown of SGE.  
>>>>>> What is
>>>>>> the best way to "clean up" these previous runs?
>>>>>>
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>>>> dsForumId=38&dsMessageId=245864
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users- 
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=245870
>>>
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>>>
>> -- 
>> Prentice Bisbal
>> Linux Software Support Specialist/System Administrator
>> School of Natural Sciences
>> Institute for Advanced Study
>> Princeton, NJ
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=245876
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245878
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=245879

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list