[GE users] sge_qmaster 6.2u5 daemon: repeating segfaults

ah_sunsource ahaupt at ifh.de
Thu Mar 25 07:19:53 GMT 2010


Hi *,

yesterday afternoon our SGE master started segfaulting again and again
"out of the blue". No changes to the configuration have been done for
weeks... Is there anyone else who has already seen this (output of
dmesg)? :

[...]
sge_qmaster[2206]: segfault at 0000000046251000 rip 0000003dfac54e17 rsp 000000004624cf80 error 6
sge_qmaster[2947]: segfault at 0000000045778000 rip 0000003dfac54e17 rsp 0000000045773f80 error 6
sge_qmaster[3677]: segfault at 00000000477b0000 rip 0000003dfac54e17 rsp 00000000477abf80 error 6
sge_qmaster[3904]: segfault at 0000000045bce000 rip 0000003dfac54e17 rsp 0000000045bc9f80 error 6
sge_qmaster[5089]: segfault at 0000000045939000 rip 0000003dfac54e17 rsp 0000000045934f80 error 6
sge_qmaster[6707]: segfault at 000000004691e000 rip 0000003dfac54e17 rsp 0000000046919f80 error 6
sge_qmaster[9552]: segfault at 0000000045e6d000 rip 0000003dfac54e17 rsp 0000000045e68f80 error 6

It's a self compiled gridengine 6.2u5 running on x86_64 Scientific Linux
5.4. There's nothing in the gridengine logs (log level: info). Although
it's probably not very useful - here's the strace output of such a dying
sge_qmaster:

[root at tcbatch0 ~]# strace -p 3886
Process 3886 attached - interrupt to quit
futex(0x87582c, FUTEX_WAIT_PRIVATE, 1, NULL) = -1 EINTR (Interrupted system call)
--- SIGRT_1 (Unknown signal 33) @ 0 (0) ---
setresgid(-1, 0, -1)                    = 0
futex(0x45bcc970, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2b9c8bc3ef0c, FUTEX_WAKE_PRIVATE, 1) = 0
rt_sigreturn(0x2b9c8bc3ef0c)            = -1 EINTR (Interrupted system call)
--- SIGRT_1 (Unknown signal 33) @ 0 (0) ---
setresuid(-1, 0, -1)                    = 0
futex(0x45bcc970, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2b9c8bc3ef0c, FUTEX_WAKE_PRIVATE, 1) = 0
rt_sigreturn(0x2b9c8bc3ef0c)            = -1 EINTR (Interrupted system call)
futex(0x87582c, FUTEX_WAIT_PRIVATE, 1, NULL) = -1 EINTR (Interrupted system call)
--- SIGRT_1 (Unknown signal 33) @ 0 (0) ---
setresgid(-1, 987, -1)                  = 0
futex(0x45bcc970, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2b9c8bc3ef0c, FUTEX_WAKE_PRIVATE, 1) = 0
rt_sigreturn(0x2b9c8bc3ef0c)            = -1 EINTR (Interrupted system call)
--- SIGRT_1 (Unknown signal 33) @ 0 (0) ---
setresuid(-1, 987, -1)                  = 0
futex(0x45bcc970, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2b9c8bc3ef0c, FUTEX_WAKE_PRIVATE, 1) = 0
rt_sigreturn(0x2b9c8bc3ef0c)            = -1 EINTR (Interrupted system call)
futex(0x87582c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
+++ killed by SIGSEGV +++

Currently the problems seems to have gone again. But I really want to
avoid something like this in future... ;-) Anything I could change to
get at least more useful debug output?

Thanks & Cheers,
Andreas
-- 
| Andreas Haupt             | E-Mail: andreas.haupt at desy.de
|  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6          | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen          | Fax:    +49/33762/7-7216

-- 
| Andreas Haupt             | E-Mail: andreas.haupt at desy.de
|  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6          | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen          | Fax:    +49/33762/7-7216

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=251291

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list