[GE users] Job Failure Deletes Local Spool Directory

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Wed Feb 23 12:53:43 GMT 2005


Okay, that makes sense to me and I want to change my local spool
location.  I try to change the spool directory on one of my machines by
running:

qconf -mconf <host>

and I get the message:

"Changing parameter "execd_spool_dir" only supported in a shut-down
cluster."

So I run:

qconf -ke all
qconf -mconf <host>

and I get the message:

"Changing parameter "execd_spool_dir" only supported in a shut-down
cluster."

So, I run

qconf -ks
qconf -km
qconf -mconf <host>

and I get the message:

unable to contact qmaster using port 461 on host "<hostname>"

So, how do I change this?

Dan

On Tue, 2005-02-22 at 10:20, Reuti wrote:

> I usually create a directory /var/spool/sge and put the SGE stuff there. 
> /var seems a good place for this. - Reuti
> 
> Dan Gruhn wrote:
> > Interesting idea, but I don't see any cron jobs that do this.  As a 
> > test, I have made a file in each /tmp dir to see if that file disappears 
> > when this happens again.
> > 
> > Anyone have any other ideas?
> > 
> > Dan
> > 
> > On Tue, 2005-02-22 at 09:00, Reuti wrote:
> > 
> >>Hi,
> >>
> >>maybe it wasn't done by SGE: is there a cron job running on the machine 
> >>to clean the /tmp from time to time?
> >>
> >>Cheers - Reuti
> >>
> >>
> >>Dan Gruhn wrote:
> >>> Greetings Everyone,
> >>> 
> >>> I am using Fedora Core 1 to run 6.0u3 and have a strange failure mode.  
> >>> I get an administration email of the following:
> >>> 
> >>> Subject:  	"N1GE 6.0u3: Job-array task 6574.262 failed "
> >>> 
> >>> 
> >>> Job 6574 caused action: Queue "low.q at class05-lx.group-w-inc.com 
> >>> <mailto:low.q at class05-lx.group-w-inc.com>" set to ERROR
> >>>  User        = dgruhn
> >>>  Queue       = low.q at class05-lx.group-w-inc.com 
> >>> <mailto:low.q at class05-lx.group-w-inc.com>
> >>>  Host        = class05-lx.group-w-inc.com
> >>>  Start Time  = <unknown>
> >>>  End Time    = <unknown>
> >>> failed assumedly before job:can't create directory active_jobs/6574.262: 
> >>> No such file or directory
> >>> 
> >>> 
> >>> 
> >>> When I look on the host, I see that the execution daemon is running just 
> >>> fine, but that my local spool directory (/tmp/sgespool in my case) is 
> >>> completely gone without a trace.  There is no /tmp/execd error file or 
> >>> anything.
> >>> 
> >>> These hosts are single processor, Pentium(R) 4 CPU 1.80GHz with 512 MB 
> >>> of RAM.  They are the least capabile in my set of hosts.  The error 
> >>> doesn't happen a lot, but it has happened enough that I'd like to solve 
> >>> it if possible.  Of course, SGE recovers the job and runs it on another 
> >>> host, but that queue is out of action until I shut down the execution 
> >>> daemon and bring it back up.  It then recreates the local spool dir and 
> >>> all is well.
> >>> 
> >>> Has anyone else experienced this or have any idea what may be 
> >>> happening?  That is, what in SGE would delete the entire local spool 
> >>> directory tree but leave the executor running?
> >>> 
> >>> Any help will be greatly appreciated.
> >>> 
> >>> Dan
> >>> 
> >>> 
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



More information about the gridengine-users mailing list