[GE users] Job Failure Deletes Local Spool Directory

Reuti reuti at staff.uni-marburg.de
Tue Feb 22 15:20:51 GMT 2005


I usually create a directory /var/spool/sge and put the SGE stuff there. 
/var seems a good place for this. - Reuti

Dan Gruhn wrote:
> Interesting idea, but I don't see any cron jobs that do this.  As a 
> test, I have made a file in each /tmp dir to see if that file disappears 
> when this happens again.
> 
> Anyone have any other ideas?
> 
> Dan
> 
> On Tue, 2005-02-22 at 09:00, Reuti wrote:
> 
>>Hi,
>>
>>maybe it wasn't done by SGE: is there a cron job running on the machine 
>>to clean the /tmp from time to time?
>>
>>Cheers - Reuti
>>
>>
>>Dan Gruhn wrote:
>>> Greetings Everyone,
>>> 
>>> I am using Fedora Core 1 to run 6.0u3 and have a strange failure mode.  
>>> I get an administration email of the following:
>>> 
>>> Subject:  	"N1GE 6.0u3: Job-array task 6574.262 failed "
>>> 
>>> 
>>> Job 6574 caused action: Queue "low.q at class05-lx.group-w-inc.com 
>>> <mailto:low.q at class05-lx.group-w-inc.com>" set to ERROR
>>>  User        = dgruhn
>>>  Queue       = low.q at class05-lx.group-w-inc.com 
>>> <mailto:low.q at class05-lx.group-w-inc.com>
>>>  Host        = class05-lx.group-w-inc.com
>>>  Start Time  = <unknown>
>>>  End Time    = <unknown>
>>> failed assumedly before job:can't create directory active_jobs/6574.262: 
>>> No such file or directory
>>> 
>>> 
>>> 
>>> When I look on the host, I see that the execution daemon is running just 
>>> fine, but that my local spool directory (/tmp/sgespool in my case) is 
>>> completely gone without a trace.  There is no /tmp/execd error file or 
>>> anything.
>>> 
>>> These hosts are single processor, Pentium(R) 4 CPU 1.80GHz with 512 MB 
>>> of RAM.  They are the least capabile in my set of hosts.  The error 
>>> doesn't happen a lot, but it has happened enough that I'd like to solve 
>>> it if possible.  Of course, SGE recovers the job and runs it on another 
>>> host, but that queue is out of action until I shut down the execution 
>>> daemon and bring it back up.  It then recreates the local spool dir and 
>>> all is well.
>>> 
>>> Has anyone else experienced this or have any idea what may be 
>>> happening?  That is, what in SGE would delete the entire local spool 
>>> directory tree but leave the executor running?
>>> 
>>> Any help will be greatly appreciated.
>>> 
>>> Dan
>>> 
>>> 
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list