Opened 13 years ago

Last modified 11 years ago

#603 new defect

IZ2810: test in init script sgeexecd is inadquate

Reported by: mathog Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u10
Severity: Keywords: PC Linux execution


[Imported from gridengine issuezilla]

        Issue #:      2810             Platform:     PC       Reporter: mathog (mathog)
       Component:     gridengine          OS:        Linux
     Subcomponent:    execution        Version:      6.0u10      CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
       * Summary:     test in init script sgeexecd is inadquate
   Status whiteboard:

     Issue 2810 blocks:
   Votes for issue 2810:

   Opened: Wed Nov 26 10:04:00 -0700 2008 

The compute nodes on my cluster NFS mount the SGE distribution on /usr/SGE6.
So SGE_ROOT is /usr/SGE6.  If during boot this NFS mount has not completed by
the time sgeexecd reaches this section of code:

while [ ! -d "$SGE_ROOT" -a $count -le 120 ]; do
    count=`expr $count + 1`
    sleep 1

an error will occur.  Since /usr/SGE6 is a directory, it has to be to NFS mount
on it, the test will pass and the script will go on, to fail later.  This
problem showed up after an upgrade from Mandriva 2007.1 to 2008.1, which
apparently changed the boot sequence timing somehow.  It took a while to find
this since, as soon as I could log in, NFS had always mounted, so that running
sgeexecd manually always worked.

My fix was to change the test from "$SGE_ROOT" to "$SGE_ROOT/bin".  Since before
the NFS mount is completed $SGE_ROOT is an empty directory, the test will fail
before the NFS mount, and will pass after it.

   ------- Additional comments from mathog Tue Dec 2 14:40:22 -0700 2008 -------
Note, my "fix" only works in the case where the NFS mount is a little delayed.
If it never comes through at all the loop will execute 120 times, never satisfy
the test, and then try to start up SGE anyway, even though there is no hope it
will succeed.

Change History (0)

Note: See TracTickets for help on using tickets.