[GE users] error writing to file "job_scripts/25024" : No such file or directory

danielgoolsby danielgoolsby at gmail.com
Tue Feb 9 00:58:27 GMT 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Sure enough there was a 'tmpwatch' cron in /etc/cron.daily that was deleting the execd.pid and one of the directories..

I re-ran inst-sge -x and corrected it.  I couldn't find a config file that had the tmp directory in it.. where is that information stored?

Daniel

On Sun, Feb 7, 2010 at 4:49 AM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
Hi,

Am 05.02.2010 um 21:03 schrieb danielgoolsby:

> I have about a 50 server (~400 slot) implementation, and I seem to be
> getting this error more often.
>
> A user would submit a job, the job would be in the queue, but for some
> reason error'ing out a few nodes in the process-- before finally being
> able to find a host that it can start.
>
> I then have to go in and clear the queue of errors where other people
> can submit jobs to the queue (with a 'qmod -cq queuename.q').
>
> If I do 'qstat -j <job  #>.. I'll get these error reasons:
>
> error reason    1:          error writing to file "job_scripts/25024":
> No such file or directory
>                 1:          error writing to file "job_scripts/25024":
> No such file or directory
>                 1:          error writing to file "job_scripts/25024":
> No such file or directory
>                 1:          error writing to file "job_scripts/25024":
> No such file or directory
>                 1:          error writing to file "job_scripts/25024":
> No such file or directory
> scheduling info:            queue instance "big.q at node1" dropped
> because
> it is disabled
>                             queue instance "big.q at node2" dropped
> because
> it is disabled
>
> etc...
>
> But the job finds a host and starts to run.  I've been getting these
> more often, but haven't figured out why.
>
> If I 'ls -l' on the execd_spool_dir I get something that looks like
> this:
>
> [root at node1 ~]# ls -l /tmp/gridengine/node1/

in many Linux distributions a cron job is removing outdated files and
directories from /tmp by default. Can you adjust your setup to use
some directory like /var/spoo/sge for the local spool files of SGE?

-- Reuti


> total 20
> drwxr-xr-x 3 root root 4096 Feb  5 10:12 active_jobs
> -rw-r--r-- 1 root root    5 Feb  3 14:34 execd.pid
> drwxr-xr-x 3 root root 4096 Feb  5 10:12 jobs
> drwxr-xr-x 2 root root 4096 Feb  5 10:12 job_scripts
> -rw-r--r-- 1 root root 2228 Feb  3 14:34 messages
>
> Whereas on a 'broken' host, I get this:
>
> [root at node3 cab103]# ls -l
> total 16
> drwxr-xr-x 2 root root 4096 Feb  5 10:12 active_jobs
> drwxr-xr-x 2 root root 4096 Feb  5 10:12 jobs
> -rw-r--r-- 1 root root 4394 Feb  3 15:30 messages
>
> Anyone have any knowledge as to why the execd.pid or the job_scripts
> directory would delete?  I can understand the job_scripts dir
> deleting,
> but not the execd.pid..
>
> Or I could be looking at the wrong information.. who knows..
>
> Can anyone help?
>
> Daniel
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?
> dsForumId=38&dsMessageId=243550
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net<mailto:unsubscribe at gridengine.sunsource.net>].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=243810

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].



--
--daniel
--



More information about the gridengine-users mailing list