Opened 13 years ago
Last modified 10 years ago
#498 new enhancement
IZ2537: sge_sheperd waits forever at job start when automountd hangs
Reported by: | wig | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.1u3 |
Severity: | Keywords: | Linux execution | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2537]
Issue #: 2537 Platform: All Reporter: wig (wig) Component: gridengine OS: Linux Subcomponent: execution Version: 6.1u3 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: ENHANCEMENT Target milestone: --- Assigned to: pollinger (pollinger) QA Contact: pollinger URL: * Summary: sge_sheperd waits forever at job start when automountd hangs Status whiteboard: Attachments: Issue 2537 blocks: Votes for issue 2537: 1 Opened: Thu Mar 27 12:45:00 -0700 2008 ------------------------ Hello, today we had the strange problem that the automountd on redhat 3 (OS: Linux muc-ax7x 2.4.21-32.ELsmp) on an execution host was blocked and therefore the users home did never get mounted. What I saw were two processes being started by sge_execd: ... grid_m 11236 18560 0 Mar26 ? 00:00:00 sge_shepherd-776746 -bg foo 11237 11236 0 Mar26 ? 00:00:00 sge_shepherd-776746 -bg ... the later one was waiting (at least > 12 hours): # strace -p 11237 Process 11237 attached - interrupt to quit --- SIGSTOP (Stopped (signal)) @ 0 (0) --- --- SIGSTOP (Stopped (signal)) @ 0 (0) --- chdir("/home/foo" <unfinished ...> Reason was, that the automountd for /home was blocked/waiting for a futex: # strace -p 18515 Process 18515 attached - interrupt to quit futex(0x2a95adb9b0, FUTEX_WAIT, 2, NULL Fortunately the user was expecting an GUI coming up. Otherwise he would have waited forever. Would it be possible to implement a timeout around the chdir and kill the sge_sheperd after an reasonable amount of time, exiting with some error status and flagging all queue instances on that host with 'E'? Usually in our environment I have lots of checks for appropriate automount and NFS behaviour (because there are a lot of issues), but this special situation IMHO cannot be detected; apart from not being able to mount yet unmounted /home/* the exec host worked absolutely o.k. Thanks, bye Wilfried P.S.: We run GE6.1u3 from gridengine.sunsource.net, qmaster on Linux/RHE4 systems, exec hosts are RHE4, RHE3 and Solaris 8.
Note: See
TracTickets for help on using
tickets.