Opened 14 years ago

Last modified 11 years ago

#498 new enhancement

IZ2537: sge_sheperd waits forever at job start when automountd hangs

Reported by: wig Owned by:
Priority: normal Milestone:
Component: sge Version: 6.1u3
Severity: Keywords: Linux execution


[Imported from gridengine issuezilla]

        Issue #:      2537             Platform:     All           Reporter: wig (wig)
       Component:     gridengine          OS:        Linux
     Subcomponent:    execution        Version:      6.1u3            CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
       * Summary:     sge_sheperd waits forever at job start when automountd hangs
   Status whiteboard:

     Issue 2537 blocks:
   Votes for issue 2537:  1

   Opened: Thu Mar 27 12:45:00 -0700 2008 


today we had the strange problem that the automountd on redhat 3
(OS: Linux muc-ax7x 2.4.21-32.ELsmp) on an execution host was blocked and
therefore the users home did never get mounted.

What I saw were two processes being started by sge_execd:
grid_m   11236 18560  0 Mar26 ?        00:00:00 sge_shepherd-776746 -bg
foo      11237 11236  0 Mar26 ?        00:00:00 sge_shepherd-776746 -bg

the later one was waiting (at least > 12 hours):
# strace -p 11237
Process 11237 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
chdir("/home/foo" <unfinished ...>

Reason was, that the automountd for /home was blocked/waiting for
a futex:
# strace -p 18515
Process 18515 attached - interrupt to quit
futex(0x2a95adb9b0, FUTEX_WAIT, 2, NULL

Fortunately the user was expecting an GUI coming up. Otherwise
he would have waited forever.

Would it be possible to implement a timeout around the chdir and kill the
sge_sheperd after an reasonable amount of time, exiting with some error status
and flagging all queue instances on that host with 'E'?

Usually in our environment I have lots of checks for appropriate automount
and NFS behaviour (because there are a lot of issues),
but this special situation IMHO cannot be detected; apart from
not being able to mount yet unmounted /home/* the exec host worked absolutely

Thanks, bye
P.S.: We run GE6.1u3 from, qmaster on Linux/RHE4
systems, exec hosts are RHE4, RHE3 and Solaris 8.

Change History (0)

Note: See TracTickets for help on using tickets.