[GE users] Broken queue - pe_hostfile permission denied

Duncan Mortimer duncan at fmrib.ox.ac.uk
Mon Apr 24 14:32:07 BST 2006


Hi,

Overnight, one of our queues has started malfunctioning. We have  
three processing queues, verylong.q, long.q and short.q, with  
verylong.q being a subordinate of short.q. We are running Grid Engine  
6u6 on Mac OS X Server 10.4.5, with NFS shared SGE_ROOT, local spool  
directories and classic spooling.

Since this morning, a high proportion of jobs submitted to the short  
queue fail, setting the queue for that particular host to the ERROR  
state. All processing hosts are effected and I would estimate around  
70-80% of the jobs fail. We have tried simple tasks such as 'ls' or  
'du' and these are just as likely to result in a failure as a _real_  
job.

The failure email contains the following:

Job 10446 caused action: Queue "short.q@<hostname>.fmrib.ox.ac.uk"  
set to ERROR
  User        = <username>
  Queue       = short.q@<hostname>.fmrib.ox.ac.uk
  Host        = <hostname>.fmrib.ox.ac.uk
  Start Time  = <unknown>
  End Time    = <unknown>
failed assumedly before job:cant open file /var/spool/sge/<hostname>/ 
active_jobs/10446.1/pe_hostfile: Permission denied

And the local message file on the exec host contains:

04/24/2006 12:50:29|execd|<hostname>|E|can't start job "10446": cant  
open file /var/spool/sge/<hostname>/active_jobs/10446.1/pe_hostfile:  
Permission denied

We've tried deleting and recreating the queue, but to no effect. Jobs  
seem to be running fine on the long.q and verylong.q queue's.

Has anyone seen this kind of behaviour before, or have and  
suggestions on how to resolve it.

Thanks

Duncan
-- 
Duncan A B Mortimer DPhil MChem
                 Computing Officer, FMRIB Centre, University of Oxford,
               John Radcliffe Hospital, Headington, Oxford OX3 9DU, UK.
Tel: (0)1865 222713                             Mobile: (0)7748 105057
WWW: http://www.fmrib.ox.ac.uk/~duncan    email: duncan at fmrib.ox.ac.uk


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list