[GE users] Job Failure Deletes Local Spool Directory

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Tue Feb 22 14:21:51 GMT 2005


Interesting idea, but I don't see any cron jobs that do this.  As a
test, I have made a file in each /tmp dir to see if that file disappears
when this happens again.

Anyone have any other ideas?

Dan

On Tue, 2005-02-22 at 09:00, Reuti wrote:

> Hi,
> 
> maybe it wasn't done by SGE: is there a cron job running on the machine 
> to clean the /tmp from time to time?
> 
> Cheers - Reuti
> 
> 
> Dan Gruhn wrote:
> > Greetings Everyone,
> > 
> > I am using Fedora Core 1 to run 6.0u3 and have a strange failure mode.  
> > I get an administration email of the following:
> > 
> > Subject:  	"N1GE 6.0u3: Job-array task 6574.262 failed "
> > 
> > 
> > Job 6574 caused action: Queue "low.q at class05-lx.group-w-inc.com 
> > <mailto:low.q at class05-lx.group-w-inc.com>" set to ERROR
> >  User        = dgruhn
> >  Queue       = low.q at class05-lx.group-w-inc.com 
> > <mailto:low.q at class05-lx.group-w-inc.com>
> >  Host        = class05-lx.group-w-inc.com
> >  Start Time  = <unknown>
> >  End Time    = <unknown>
> > failed assumedly before job:can't create directory active_jobs/6574.262: 
> > No such file or directory
> > 
> > 
> > 
> > When I look on the host, I see that the execution daemon is running just 
> > fine, but that my local spool directory (/tmp/sgespool in my case) is 
> > completely gone without a trace.  There is no /tmp/execd error file or 
> > anything.
> > 
> > These hosts are single processor, Pentium(R) 4 CPU 1.80GHz with 512 MB 
> > of RAM.  They are the least capabile in my set of hosts.  The error 
> > doesn't happen a lot, but it has happened enough that I'd like to solve 
> > it if possible.  Of course, SGE recovers the job and runs it on another 
> > host, but that queue is out of action until I shut down the execution 
> > daemon and bring it back up.  It then recreates the local spool dir and 
> > all is well.
> > 
> > Has anyone else experienced this or have any idea what may be 
> > happening?  That is, what in SGE would delete the entire local spool 
> > directory tree but leave the executor running?
> > 
> > Any help will be greatly appreciated.
> > 
> > Dan
> > 
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



More information about the gridengine-users mailing list