[GE users] Broken queue - pe_hostfile permission denied
duncan at fmrib.ox.ac.uk
Mon Apr 24 14:32:07 BST 2006
Overnight, one of our queues has started malfunctioning. We have
three processing queues, verylong.q, long.q and short.q, with
verylong.q being a subordinate of short.q. We are running Grid Engine
6u6 on Mac OS X Server 10.4.5, with NFS shared SGE_ROOT, local spool
directories and classic spooling.
Since this morning, a high proportion of jobs submitted to the short
queue fail, setting the queue for that particular host to the ERROR
state. All processing hosts are effected and I would estimate around
70-80% of the jobs fail. We have tried simple tasks such as 'ls' or
'du' and these are just as likely to result in a failure as a _real_
The failure email contains the following:
Job 10446 caused action: Queue "short.q@<hostname>.fmrib.ox.ac.uk"
set to ERROR
User = <username>
Queue = short.q@<hostname>.fmrib.ox.ac.uk
Host = <hostname>.fmrib.ox.ac.uk
Start Time = <unknown>
End Time = <unknown>
failed assumedly before job:cant open file /var/spool/sge/<hostname>/
active_jobs/10446.1/pe_hostfile: Permission denied
And the local message file on the exec host contains:
04/24/2006 12:50:29|execd|<hostname>|E|can't start job "10446": cant
open file /var/spool/sge/<hostname>/active_jobs/10446.1/pe_hostfile:
We've tried deleting and recreating the queue, but to no effect. Jobs
seem to be running fine on the long.q and verylong.q queue's.
Has anyone seen this kind of behaviour before, or have and
suggestions on how to resolve it.
Duncan A B Mortimer DPhil MChem
Computing Officer, FMRIB Centre, University of Oxford,
John Radcliffe Hospital, Headington, Oxford OX3 9DU, UK.
Tel: (0)1865 222713 Mobile: (0)7748 105057
WWW: http://www.fmrib.ox.ac.uk/~duncan email: duncan at fmrib.ox.ac.uk
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users