[GE users] Job Failure Deletes Local Spool Directory

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Tue Feb 22 13:51:27 GMT 2005


Greetings Everyone,

I am using Fedora Core 1 to run 6.0u3 and have a strange failure mode. 
I get an administration email of the following:

                          Subject: 
"N1GE 6.0u3: Job-array task 6574.262
failed "
Job 6574 caused action: Queue "low.q at class05-lx.group-w-inc.com" set to
ERROR
 User        = dgruhn
 Queue       = low.q at class05-lx.group-w-inc.com
 Host        = class05-lx.group-w-inc.com
 Start Time  = <unknown>
 End Time    = <unknown>
failed assumedly before job:can't create directory active_jobs/6574.262:
No such file or directory

When I look on the host, I see that the execution daemon is running just
fine, but that my local spool directory (/tmp/sgespool in my case) is
completely gone without a trace.  There is no /tmp/execd error file or
anything.

These hosts are single processor, Pentium(R) 4 CPU 1.80GHz with 512 MB
of RAM.  They are the least capabile in my set of hosts.  The error
doesn't happen a lot, but it has happened enough that I'd like to solve
it if possible.  Of course, SGE recovers the job and runs it on another
host, but that queue is out of action until I shut down the execution
daemon and bring it back up.  It then recreates the local spool dir and
all is well.

Has anyone else experienced this or have any idea what may be
happening?  That is, what in SGE would delete the entire local spool
directory tree but leave the executor running?

Any help will be greatly appreciated.

Dan





More information about the gridengine-users mailing list