[GE users] Job Failure Deletes Local Spool Directory

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Wed Feb 23 14:39:25 GMT 2005


Reuti,

I don't have such a file.  Is it because I'm using Berkley DB?

Dan
On Wed, 2005-02-23 at 08:07, Reuti wrote:

> Do you have a different entry for each host? Otherwise you could edit 
> the entry (and only in this case!) by hand in 
> $SGE_ROOT/default/common/configuration
> 
> My entries there are:
> 
> qmaster_spool_dir         /var/spool/sge/qmaster
> execd_spool_dir           /var/spool/sge
> 
> Cheers - Reuti
> 
> Dan Gruhn wrote:
> > Okay, that makes sense to me and I want to change my local spool 
> > location.  I try to change the spool directory on one of my machines by 
> > running:
> > 
> > qconf -mconf <host>
> > 
> > and I get the message:
> > 
> > "Changing parameter "execd_spool_dir" only supported in a shut-down 
> > cluster."
> > 
> > So I run:
> > 
> > qconf -ke all
> > qconf -mconf <host>
> > 
> > and I get the message:
> > 
> > "Changing parameter "execd_spool_dir" only supported in a shut-down 
> > cluster."
> > 
> > So, I run
> > 
> > qconf -ks
> > qconf -km
> > qconf -mconf <host>
> > 
> > and I get the message:
> > 
> > unable to contact qmaster using port 461 on host "<hostname>"
> > 
> > So, how do I change this?
> > 
> > Dan
> > 
> > On Tue, 2005-02-22 at 10:20, Reuti wrote:
> > 
> >>I usually create a directory /var/spool/sge and put the SGE stuff there. 
> >>/var seems a good place for this. - Reuti
> >>
> >>Dan Gruhn wrote:
> >>> Interesting idea, but I don't see any cron jobs that do this.  As a 
> >>> test, I have made a file in each /tmp dir to see if that file disappears 
> >>> when this happens again.
> >>> 
> >>> Anyone have any other ideas?
> >>> 
> >>> Dan
> >>> 
> >>> On Tue, 2005-02-22 at 09:00, Reuti wrote:
> >>> 
> >>>>Hi,
> >>>>
> >>>>maybe it wasn't done by SGE: is there a cron job running on the machine 
> >>>>to clean the /tmp from time to time?
> >>>>
> >>>>Cheers - Reuti
> >>>>
> >>>>
> >>>>Dan Gruhn wrote:
> >>>>> Greetings Everyone,
> >>>>> 
> >>>>> I am using Fedora Core 1 to run 6.0u3 and have a strange failure mode.  
> >>>>> I get an administration email of the following:
> >>>>> 
> >>>>> Subject:  	"N1GE 6.0u3: Job-array task 6574.262 failed "
> >>>>> 
> >>>>> 
> >>>>> Job 6574 caused action: Queue "low.q at class05-lx.group-w-inc.com 
> >>>>> <mailto:low.q at class05-lx.group-w-inc.com>" set to ERROR
> >>>>>  User        = dgruhn
> >>>>>  Queue       = low.q at class05-lx.group-w-inc.com 
> >>>>> <mailto:low.q at class05-lx.group-w-inc.com>
> >>>>>  Host        = class05-lx.group-w-inc.com
> >>>>>  Start Time  = <unknown>
> >>>>>  End Time    = <unknown>
> >>>>> failed assumedly before job:can't create directory active_jobs/6574.262: 
> >>>>> No such file or directory
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> When I look on the host, I see that the execution daemon is running just 
> >>>>> fine, but that my local spool directory (/tmp/sgespool in my case) is 
> >>>>> completely gone without a trace.  There is no /tmp/execd error file or 
> >>>>> anything.
> >>>>> 
> >>>>> These hosts are single processor, Pentium(R) 4 CPU 1.80GHz with 512 MB 
> >>>>> of RAM.  They are the least capabile in my set of hosts.  The error 
> >>>>> doesn't happen a lot, but it has happened enough that I'd like to solve 
> >>>>> it if possible.  Of course, SGE recovers the job and runs it on another 
> >>>>> host, but that queue is out of action until I shut down the execution 
> >>>>> daemon and bring it back up.  It then recreates the local spool dir and 
> >>>>> all is well.
> >>>>> 
> >>>>> Has anyone else experienced this or have any idea what may be 
> >>>>> happening?  That is, what in SGE would delete the entire local spool 
> >>>>> directory tree but leave the executor running?
> >>>>> 
> >>>>> Any help will be greatly appreciated.
> >>>>> 
> >>>>> Dan
> >>>>> 
> >>>>> 
> >>>>
> >>>>
> >>>>---------------------------------------------------------------------
> >>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>>
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



More information about the gridengine-users mailing list