[GE users] Job Failure Deletes Local Spool Directory
Dan.Gruhn at Group-W-Inc.com
Tue Feb 22 13:51:27 GMT 2005
I am using Fedora Core 1 to run 6.0u3 and have a strange failure mode.
I get an administration email of the following:
"N1GE 6.0u3: Job-array task 6574.262
Job 6574 caused action: Queue "low.q at class05-lx.group-w-inc.com" set to
User = dgruhn
Queue = low.q at class05-lx.group-w-inc.com
Host = class05-lx.group-w-inc.com
Start Time = <unknown>
End Time = <unknown>
failed assumedly before job:can't create directory active_jobs/6574.262:
No such file or directory
When I look on the host, I see that the execution daemon is running just
fine, but that my local spool directory (/tmp/sgespool in my case) is
completely gone without a trace. There is no /tmp/execd error file or
These hosts are single processor, Pentium(R) 4 CPU 1.80GHz with 512 MB
of RAM. They are the least capabile in my set of hosts. The error
doesn't happen a lot, but it has happened enough that I'd like to solve
it if possible. Of course, SGE recovers the job and runs it on another
host, but that queue is out of action until I shut down the execution
daemon and bring it back up. It then recreates the local spool dir and
all is well.
Has anyone else experienced this or have any idea what may be
happening? That is, what in SGE would delete the entire local spool
directory tree but leave the executor running?
Any help will be greatly appreciated.
More information about the gridengine-users