[GE users] slot taken no job running

Stephan Grell stephan.grell at sun.com
Wed Aug 18 07:37:18 BST 2004


Don Shesnicky wrote:

> 
>
>  
>
>>Did you had a look at the messages file of the queue master in
>>    
>>
>$SGE_ROOT/default/spool/qmaster/messages?
>  
>
>>I would suggest to shut down the execd on kane.enqsemi.com, then go to
>>    
>>
>the spool directory of this 
>  
>
>>node $SGE_ROOT/default/spool/kane.enqsemi.com and look inside the
>>    
>>
>directories: active_jobs, jobs, 
>  
>
>>job_scripts and remove all what's ever inside (you can also have a
>>    
>>
>look there before you shut down 
>  
>
>>anything, just to see, whether there is more than one job mentioned at
>>    
>>
>all, otherwise we have to look 
>  
>
>>somewhere else). Then restart the execd and we will see, whether it's
>>    
>>
>gone.
>
>I've looked in the qmaster messages file and only this stands out:
>
>08/16/2004 09:40:27|qmaster|canter|E|error writing object with key
>"EXECHOST:kane.enqsemi.com" 
>   into berkeley database: (28) No space left on device
>08/16/2004 09:43:11|qmaster|canter|W|job 8062.1 failed on host
>kane.enqsemi.com general assumedly 
>   before job because: can't create directory active_jobs/8062.1: No
>space left on device
>
>  
>
I think it sounds more like the host with your berkeley db is running 
out of disk space and
the qmaster cannot modify the berkeley db. Did you check that? You might 
have to
remove the job from the database. Did you try a "qdel -f 8062"?

>I take it that job 8062.1 is the problem and is probably stuck in the
>berkeley db.
>
>I've totally removed the kane directory in $SGE_ROOT/default/spool/kane
>and re-installed it as 
>an exec host but that did nothing, it's still showing 1/2 slots full
>while no jobs are running on it.
>
>
>------------------------------------------------------------------------
>----
>d.norm at kane.enqsemi.com        BIP   1/2       0.14     lx24-x86      o
>------------------------------------------------------------------------
>----
>
>job-ID  prior   name       user         state submit/start at     queue
>slots ja-task-ID 
>------------------------------------------------------------------------
>-----------------------------------------
>   8591 0.56000 tc_049_ab_ mllalami     r     08/17/2004 12:31:08
>d.norm at dexter.enqsemi.com          1        
>   8592 0.56000 tc_049_ab_ mllalami     r     08/17/2004 12:34:19
>d.norm at dexter.enqsemi.com          1        
>   8589 0.56000 tc_037_ei_ jrusmussen   r     08/17/2004 12:30:02
>d.norm at forge.enqsemi.com           1        
>
>
>
>
>Don
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list