Opened 14 years ago

Last modified 9 years ago

#281 new defect

IZ1830: Spool directory missing contents

Reported by: js631 Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u6
Severity: Keywords: PC Linux execution
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1830]

        Issue #:      1830             Platform:     PC       Reporter: js631 (js631)
       Component:     gridengine          OS:        Linux
     Subcomponent:    execution        Version:      6.0u6       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    roland (roland)
      QA Contact:     pollinger
          URL:
       * Summary:     Spool directory missing contents
   Status whiteboard:
      Attachments:

     Issue 1830 blocks:
   Votes for issue 1830:


   Opened: Tue Oct 11 15:22:00 -0700 2005 
------------------------


Normally, the local spool directory on the node should contain the following:

[jsu2@node18 node14]$ pwd
/var/tmp/spool/node18
[jsu2@node18 node14]$ ls -al
total 28
drwxr-xr-x    5 sgeadmin department     4096 Jul  8 15:46 .
drwxr-xr-x    3 sgeadmin department     4096 Jul  8 12:03 ..
drwxr-xr-x    2 sgeadmin department     4096 Jul  8 15:47 active_jobs
-rw-r--r--    1 sgeadmin department        5 Jul  8 15:46 execd.pid
drwxr-xr-x    2 sgeadmin department     4096 Jul  8 15:47 jobs
drwxr-xr-x    2 sgeadmin department     4096 Jul  8 15:47 job_scripts
-rw-r--r--    1 sgeadmin department     4008 Jul  8 15:46 messages

However, for some reason on mine, those contents are missing, except the message
file, which looks as follows:

09/28/2005 11:11:32|execd|node14|E|shepherd of job 2129.1 exited with exit
status = 10
09/28/2005 11:11:32|execd|node14|W|reaping job "2129" ptf complains: Job does
not exist
10/04/2005 14:02:08|execd|node14|E|can't start job "2150": can't create
directory active_jobs/2150.1: No such file or directory
10/04/2005 14:02:14|execd|node14|E|acknowledge for unknown job 2150.1/master
10/04/2005 14:02:14|execd|node14|E|can't find active jobs directory
"active_jobs/2150.1" for reaping job 2150
10/04/2005 14:02:14|execd|node14|E|ERROR: unlinking "jobs/00/0000/2150.1": No
such file or directory
10/04/2005 14:02:14|execd|node14|E|can not remove file job spool file:
jobs/00/0000/2150.1
10/04/2005 14:02:14|execd|node14|E|can't remove directory "active_jobs/2150.1":
opendir(active_jobs/2150.1) failed: No such file or directory
10/11/2005 08:45:25|execd|node14|E|shepherd of job 1843.1 exited with exit
status = 11
10/11/2005 08:45:25|execd|node14|E|can't find directory active_jobs/1843.1 for
reaping job 1843.1
10/11/2005 08:45:25|execd|node14|E|can't remove directory "active_jobs/1843.1":
opendir(active_jobs/1843.1) failed: No such file or directory
10/11/2005 08:45:25|execd|node14|E|recursive rmdir(/tmp/1843.1.all.q):
opendir(/tmp/1843.1.all.q) failed: No such file or directory

10/11/2005 08:45:25|execd|node14|E|ERROR: unlinking "jobs/00/0000/1843.1": No
such file or directory
10/11/2005 08:45:25|execd|node14|E|can not remove file job spool file:
jobs/00/0000/1843.1

So, I end up recreating those subdirectories, restarting the sgeexecd process on
that node, and clearing the error in the queue using qmod.

   ------- Additional comments from roland Thu Oct 13 01:06:06 -0700 2005 -------
changed to execution

Change History (0)

Note: See TracTickets for help on using tickets.