[GE users] error writing to file "job_scripts/25024" : No such file or directory

reuti reuti at staff.uni-marburg.de
Sun Feb 7 10:49:07 GMT 2010


Hi,

Am 05.02.2010 um 21:03 schrieb danielgoolsby:

> I have about a 50 server (~400 slot) implementation, and I seem to be
> getting this error more often.
>
> A user would submit a job, the job would be in the queue, but for some
> reason error'ing out a few nodes in the process-- before finally being
> able to find a host that it can start.
>
> I then have to go in and clear the queue of errors where other people
> can submit jobs to the queue (with a 'qmod -cq queuename.q').
>
> If I do 'qstat -j <job  #>.. I'll get these error reasons:
>
> error reason    1:          error writing to file "job_scripts/25024":
> No such file or directory
>                 1:          error writing to file "job_scripts/25024":
> No such file or directory
>                 1:          error writing to file "job_scripts/25024":
> No such file or directory
>                 1:          error writing to file "job_scripts/25024":
> No such file or directory
>                 1:          error writing to file "job_scripts/25024":
> No such file or directory
> scheduling info:            queue instance "big.q at node1" dropped  
> because
> it is disabled
>                             queue instance "big.q at node2" dropped  
> because
> it is disabled
>
> etc...
>
> But the job finds a host and starts to run.  I've been getting these
> more often, but haven't figured out why.
>
> If I 'ls -l' on the execd_spool_dir I get something that looks like
> this:
>
> [root at node1 ~]# ls -l /tmp/gridengine/node1/

in many Linux distributions a cron job is removing outdated files and  
directories from /tmp by default. Can you adjust your setup to use  
some directory like /var/spoo/sge for the local spool files of SGE?

-- Reuti


> total 20
> drwxr-xr-x 3 root root 4096 Feb  5 10:12 active_jobs
> -rw-r--r-- 1 root root    5 Feb  3 14:34 execd.pid
> drwxr-xr-x 3 root root 4096 Feb  5 10:12 jobs
> drwxr-xr-x 2 root root 4096 Feb  5 10:12 job_scripts
> -rw-r--r-- 1 root root 2228 Feb  3 14:34 messages
>
> Whereas on a 'broken' host, I get this:
>
> [root at node3 cab103]# ls -l
> total 16
> drwxr-xr-x 2 root root 4096 Feb  5 10:12 active_jobs
> drwxr-xr-x 2 root root 4096 Feb  5 10:12 jobs
> -rw-r--r-- 1 root root 4394 Feb  3 15:30 messages
>
> Anyone have any knowledge as to why the execd.pid or the job_scripts
> directory would delete?  I can understand the job_scripts dir  
> deleting,
> but not the execd.pid..
>
> Or I could be looking at the wrong information.. who knows..
>
> Can anyone help?
>
> Daniel
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=243550
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=243810

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list