[GE users] error writing to file "job_scripts/25024" : No such file or directory

danielgoolsby danielgoolsby at gmail.com
Fri Feb 5 20:03:12 GMT 2010


I have about a 50 server (~400 slot) implementation, and I seem to be
getting this error more often.

A user would submit a job, the job would be in the queue, but for some
reason error'ing out a few nodes in the process-- before finally being
able to find a host that it can start.

I then have to go in and clear the queue of errors where other people
can submit jobs to the queue (with a 'qmod -cq queuename.q').

If I do 'qstat -j <job  #>.. I'll get these error reasons:

error reason    1:          error writing to file "job_scripts/25024":
No such file or directory
                1:          error writing to file "job_scripts/25024":
No such file or directory
                1:          error writing to file "job_scripts/25024":
No such file or directory
                1:          error writing to file "job_scripts/25024":
No such file or directory
                1:          error writing to file "job_scripts/25024":
No such file or directory
scheduling info:            queue instance "big.q at node1" dropped because
it is disabled
                            queue instance "big.q at node2" dropped because
it is disabled

etc...

But the job finds a host and starts to run.  I've been getting these
more often, but haven't figured out why.

If I 'ls -l' on the execd_spool_dir I get something that looks like
this:

[root at node1 ~]# ls -l /tmp/gridengine/node1/
total 20
drwxr-xr-x 3 root root 4096 Feb  5 10:12 active_jobs
-rw-r--r-- 1 root root    5 Feb  3 14:34 execd.pid
drwxr-xr-x 3 root root 4096 Feb  5 10:12 jobs
drwxr-xr-x 2 root root 4096 Feb  5 10:12 job_scripts
-rw-r--r-- 1 root root 2228 Feb  3 14:34 messages

Whereas on a 'broken' host, I get this:

[root at node3 cab103]# ls -l
total 16
drwxr-xr-x 2 root root 4096 Feb  5 10:12 active_jobs
drwxr-xr-x 2 root root 4096 Feb  5 10:12 jobs
-rw-r--r-- 1 root root 4394 Feb  3 15:30 messages

Anyone have any knowledge as to why the execd.pid or the job_scripts
directory would delete?  I can understand the job_scripts dir deleting,
but not the execd.pid..

Or I could be looking at the wrong information.. who knows..

Can anyone help?

Daniel

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=243550

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list