#749 closed defect (duplicate)
IZ3194: sge_shepherd segfault on OpenSuSE 11.2 (x86_64)
Reported by: | megware | Owned by: | |
---|---|---|---|
Priority: | highest | Milestone: | |
Component: | sge | Version: | current |
Severity: | minor | Keywords: | PC Linux execution |
Cc: |
Description (last modified by admin)
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3194]
Issue #: 3194 Platform: PC Reporter: megware (megware) Component: gridengine OS: Linux Subcomponent: execution Version: current CC: None defined Status: REOPENED Priority: P1 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: shaas (shaas) QA Contact: pollinger URL: * Summary: sge_shepherd segfault on OpenSuSE 11.2 (x86_64) Status whiteboard: Attachments: Date/filename: Description: Submitted by: Mon Jan 25 06:27:00 -0700 2010: shepherd.strace.9389 strace -f -ff sge_shepherd, main process (text/plain) megware Mon Jan 25 06:28:00 -0700 2010: shepherd.strace.9390 strace -f -ff sge_shepherd, forked child process (text/plain) megware Issue 3194 blocks: Votes for issue 3194: 19 Opened: Thu Nov 26 07:25:00 -0700 2009 ------------------------ the sge_shepherd seems to die instantaneously at job startup. When I run qsub (ie. echo "sleep 10"|qsub) and watch the scheduling with qstat the job will go to running state ('r') and disappears from the jobs list right thereafter. The qmaster log always shows a line like this in that moment: 11/26/2009 15:01:38|worker|frontend1|I|removing trigger to terminate job 6.1 11/26/2009 15:01:38|worker|frontend1|W|job 6.1 failed on host node11.service assumedly after job because: job 6.1 died through signal SEGV (11) checking node11.service reveals this in dmesg: sge_shepherd[16061]: segfault at 2b9670000000 ip 00002b9673450939 sp 00007fff37f61d40 error 4 in libc-2.10.1.so[2b96733d9000+151000] (there appears one line per job started) ------- Additional comments from megware Fri Nov 27 02:40:06 -0700 2009 ------- this one is quite important to me (and I hope I am not the only one using OpenSuSE 11.2?). Please let me know if you need additional input and/or want me to try something. cheers, stephan ------- Additional comments from shaas Fri Nov 27 03:14:27 -0700 2009 ------- I'll take care of it! ------- Additional comments from shaas Fri Nov 27 03:19:35 -0700 2009 ------- This error seems to be related to issue 3193 and issue 3192 which is already fixed in maintrunk. What is happening if the environment variable MALLOC_CHECK_ is set to 0? Can you reproduce it (be aware that this setting might also affect other applications)? ------- Additional comments from megware Fri Nov 27 03:56:46 -0700 2009 ------- I modified /etc/init.d/sgeexecd to set MALLOC_CHECK_ right before sgeexecd is invoked and tried with two approaches: [...] export MALLOC_CHECK_=0 $bin_dir/sge_execd [...] and [...] MALLOC_CHECK_=0 $bin_dir/sge_execd [...] neither of which caused a difference in behaviour, ie. the job still dies and there is SIGSEGV reported in qmaster log and a segfault line in dmesg on the node. Interestingly I so far failed to reproduce this on a virtual machine installation here. The main difference to the cluster is that there's a more complex setup involving NIS and automount for user homes (I can't really simulate this). Could that be related? stephan ------- Additional comments from shaas Fri Nov 27 05:03:25 -0700 2009 ------- As it only appears in combination with NIS this is a known bug and it's already fixed in maintrunk (SH-2009-10-29-0). See issue 3193 for more information. *** This issue has been marked as a duplicate of 3193 *** ------- Additional comments from megware Fri Jan 22 02:29:09 -0700 2010 ------- I upgraded to 6.2u5 and it is not solved. The error output looks a bit different now. I see 01/22/2010 10:13:36|worker|frontend1|I|removing trigger to terminate job 12.1 01/22/2010 10:13:36|worker|frontend1|W|job 12.1 failed on host node02.service assumedly after job because: job 12.1 died through signal ABRT (6) in the qmaster messages file. The signal is now ABRT, not SEGV. As a consequence (?) there is no segfault line in dmesg on the computer nodes anymore. Increasing the log level to info does not reveal more information. Maybe this is a different problem and not related to 3193/3192? ------- Additional comments from reuti Fri Jan 22 04:12:30 -0700 2010 ------- There was also a discussion on the mailing list about ABRT http://gridengine.sunsource.net/ds/viewMessage.do?dsMessageId=222925&dsForumId=38 which stopped at some point. I wonder where the sigabrt is coming from. I think it's not generated by SGE but by something else and you only see the result of it. I checked SGE's source and AFAICS it's not send therein anywhere. ------- Additional comments from megware Fri Jan 22 05:50:08 -0700 2010 ------- is sge_execd/sge_shepherd doing anything related to UIDs or GIDs during the early phases of a job? I ask because a key difference between my test system (where this does _not_ happen) and the cluster (where it happens) is NIS. ------- Additional comments from megware Fri Jan 22 06:30:30 -0700 2010 ------- the ABRT discussion on the mailing list sounds very much like the same thing. resources do not seem to matter much. Any job here has this problem. I can trigger it with running something as simple as $ echo "sleep 10; hostname" | qsub Your job 12 ("STDIN") has been submitted $ If I call a binary (that is known to work) or do something like 'touch $HOME/somefile', with or without job script, sge does not seem to actually invoke it. ------- Additional comments from reuti Fri Jan 22 06:43:24 -0700 2010 ------- Any hints when you use: echo "strace sleep 10; strace hostname" | qsub ------- Additional comments from megware Fri Jan 22 07:21:31 -0700 2010 ------- nope. Gives nothing but the usual: 01/22/2010 15:13:21|worker|frontend1|W|job 13.1 failed on host node02.service assumedly after job because: job 13.1 died through signal ABRT (6) running it directly works: $ ssh node02.service strace sleep 10 execve("/bin/sleep", ["sleep", "10"], [/* 50 vars */]) = 0 brk(0) = 0x606000 [...] clock_gettime(CLOCK_MONOTONIC, {2781295, 74039388}) = 0 nanosleep({10, 0}, NULL) = 0 clock_gettime(CLOCK_MONOTONIC, {2781305, 74160613}) = 0 close(1) = 0 close(2) = 0 exit_group(0) when watching ps output on the node during a job I never see a 'sleep' process. There is only sge_shepherd in 'Defunct' state visible (I might be too slow and not catch it in the right moment however) ------- Additional comments from reuti Sat Jan 23 07:21:49 -0700 2010 ------- I just checked on a openSUSE 11.2 system and the latest binaries are working fine for me, even when using local NIS. What about renaming the sge_shepherd to sge_shepherd.orig and use strace sge_shepherd > /tmp/strace.shepherd in a wrapper script. The other option is to try the debugging facility inside SGE: http://blogs.sun.com/templedf/entry/using_debugging_output ------- Additional comments from megware Mon Jan 25 05:35:40 -0700 2010 ------- good idea. Here we go: $ echo "sleep 10;" | qsub Your job 15 ("STDIN") has been submitted node02:/var/spool/sge/node02 # less /tmp/shepherd.strace 01/25/2010 13:29:22 [0:9087]: shepherd called with uid = 0, euid = 0 01/25/2010 13:29:22 [104:9087]: starting up 6.2u5 01/25/2010 13:29:22 [104:9087]: setpgid(9087, 9087) returned 0 01/25/2010 13:29:22 [104:9087]: do_core_binding: "binding" parameter not found in config file 01/25/2010 13:29:22 [104:9087]: no prolog script to start 01/25/2010 13:29:22 [104:9088]: child: starting son(job, /var/spool/sge/node02/job_scripts/15, 0); 01/25/2010 13:29:22 [104:9087]: parent: forked "job" with pid 9088 01/25/2010 13:29:22 [104:9087]: parent: job-pid: 9088 01/25/2010 13:29:22 [104:9087]: wait3 returned 9088 (status: 6; WIFSIGNALED: 1, WIFEXITED: 0, WEXITSTATUS: 0) 01/25/2010 13:29:22 [104:9087]: job exited with exit status 0 01/25/2010 13:29:22 [104:9087]: reaped "job" with pid 9088 01/25/2010 13:29:22 [104:9087]: job exited due to signal 01/25/2010 13:29:22 [104:9087]: job signaled: 6 01/25/2010 13:29:22 [104:9087]: now sending signal KILL to pid -9088 01/25/2010 13:29:22 [104:9087]: writing usage file to "usage" 01/25/2010 13:29:22 [104:9087]: no tasker to notify 01/25/2010 13:29:22 [104:9087]: no epilog script to start -- hmm, does not look like strace.I see if I failed to catch stderr and will try -f. But maybe the above gives some hint already? ------- Additional comments from megware Mon Jan 25 06:27:29 -0700 2010 ------- Created an attachment (id=197) strace -f -ff sge_shepherd, main process ------- Additional comments from megware Mon Jan 25 06:28:11 -0700 2010 ------- Created an attachment (id=198) strace -f -ff sge_shepherd, forked child process ------- Additional comments from megware Mon Jan 25 06:32:16 -0700 2010 ------- please take a look at the two attached files. In the forked child process (shepherd.strace.9390) appears a glibc 'free(): invalid pointer' error. Now I dont understand whats causing it but that seems to trigger the ABRT signal. ------- Additional comments from reuti Mon Jan 25 07:03:48 -0700 2010 ------- Yep, this is something I also found about a possible cause of the ABRT signal. But why is it not happening to me? Maybe it depends on the actual configuration of certain queues or alike. Can you now try again with your: [...] export MALLOC_CHECK_=0 $bin_dir/sge_execd [...] You are using 6.2u5? ------- Additional comments from shaas Mon Jan 25 07:06:06 -0700 2010 ------- This should be fixed in maintrunk. Could you recompile grid engine with latest jemalloc.c from maintrunk (version 1.6) and give it a try? ------- Additional comments from megware Mon Jan 25 08:02:19 -0700 2010 ------- I recompiled the most recent code from maintrunk, including jemalloc.c v1.6 and moved the new sge_shepherd binary into the installation (sufficient? Or do I have to replace other binaries/libs as well?). Its still crashing and does not look much different to me: [...] read(6, "# /etc/default/nss\n# This file c"..., 4096) = 1685 read(6, "", 4096) = 0 open("/dev/tty", O_RDWR|O_NOCTTY|O_NONBLOCK) = -1 ENXIO (No such device or address) writev(2, [{"*** glibc detected *** ", 23}, {"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {": ", 2}, {"free(): invalid pointer", 23}, {": 0x", 4}, {"00002b7aee34e080", 16}, {" ***\n", 5}], 7) = 122 open("/etc/ld.so.cache", O_RDONLY) = 7 fstat(7, {st_mode=S_IFREG|0644, st_size=42808, ...}) = 0 mmap(NULL, 42808, PROT_READ, MAP_PRIVATE, 7, 0) = 0x2b7aee401000 close(7) = 0 open("/lib64/libgcc_s.so.1", O_RDONLY) = 7 read(7, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200-\0\0\0\0\0\0"..., 832) = 832 fstat(7, {st_mode=S_IFREG|0755, st_size=92648, ...}) = 0 mmap(NULL, 2188280, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 7, 0) = 0x2b7aef4fa000 fadvise64(7, 0, 2188280, POSIX_FADV_WILLNEED) = 0 mprotect(0x2b7aef510000, 2093056, PROT_NONE) = 0 mmap(0x2b7aef70f000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 7, 0x15000) = 0x2b7aef70f000 close(7) = 0 mprotect(0x2b7aef70f000, 4096, PROT_READ) = 0 munmap(0x2b7aee401000, 42808) = 0 futex(0x2b7aeeec65b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 futex(0x2b7aef710190, FUTEX_WAKE_PRIVATE, 2147483647) = 0 write(2, "======= Backtrace: =========\n", 29) = 29 writev(2, [{"/lib64/libc.so.6", 16}, {"[0x", 3}, {"2b7aeebdfc76", 12}, {"]\n", 2}], 4) = 33 writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"cfree", 5}, {"+0x", 3}, {"6c", 2}, {")", 1}, {"[0x", 3}, {"2b7aeebe496c", 12}, {"]\n", 2}], 9) = 45 writev(2, [{"/lib64/libnsl.so.1", 18}, {"[0x", 3}, {"2b7aef2f1c2a", 12}, {"]\n", 2}], 4) = 35 writev(2, [{"/lib64/libpthread.so.0", 22}, {"(", 1}, {"pthread_once", 12}, {"+0x", 3}, {"53", 2}, {")", 1}, {"[0x", 3}, {"2b7aee95c0f3", 12}, {"]\n", 2}], 9) = 58 writev(2, [{"/lib64/libnsl.so.1", 18}, {"(", 1}, {"_nsl_default_nss", 16}, {"+0x", 3}, {"21", 2}, {")", 1}, {"[0x", 3}, {"2b7aef2f1d41", 12}, {"]\n", 2}], 9) = 58 writev(2, [{"/lib64/libnss_nis.so.2", 22}, {"(", 1}, {"_nss_nis_initgroups_dyn", 23}, {"+0x", 3}, {"6a", 2}, {")", 1}, {"[0x", 3}, {"2b7aef0dec8a", 12}, {"]\n", 2}], 9) = 69 writev(2, [{"/lib64/libc.so.6", 16}, {"[0x", 3}, {"2b7aeec0a1c2", 12}, {"]\n", 2}], 4) = 33 writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"initgroups", 10}, {"+0x", 3}, {"6c", 2}, {")", 1}, {"[0x", 3}, {"2b7aeec0a38c", 12}, {"]\n", 2}], 9) = 50 writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"51eca4", 6}, {"]\n", 2}], 4) = 60 writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"51ea47", 6}, {"]\n", 2}], 4) = 60 writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"40d3f3", 6}, {"]\n", 2}], 4) = 60 writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"409841", 6}, {"]\n", 2}], 4) = 60 writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"409276", 6}, {"]\n", 2}], 4) = 60 writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_start_main", 17}, {"+0x", 3}, {"fd", 2}, {")", 1}, {"[0x", 3}, {"2b7aeeb8ba7d", 12}, {"]\n", 2}], 9) = 57 writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"408009", 6}, {"]\n", 2}], 4) = 60 write(2, "======= Memory map: ========\n", 29) = 29 open("/proc/self/maps", O_RDONLY) = 7 read(7, "00400000-00576000 r-xp 00000000 "..., 1024) = 1024 write(2, "00400000-00576000 r-xp 00000000 "..., 1024) = 1024 read(7, "o\n2b7aee6f9000-2b7aee6fa000 r--p"..., 1024) = 1024 write(2, "o\n2b7aee6f9000-2b7aee6fa000 r--p"..., 1024) = 1024 read(7, ":00 0 \n2b7aeeb6d000-2b7aeecbe000"..., 1024) = 1024 write(2, ":00 0 \n2b7aeeb6d000-2b7aeecbe000"..., 1024) = 1024 read(7, " /lib64/libnss_nis-2.10.1.so\n2b"..., 1024) = 1024 write(2, " /lib64/libnss_nis-2.10.1.so\n2b"..., 1024) = 1024 read(7, " /lib64/libgcc_s.so.1\n7fffbc7"..., 1024) = 270 write(2, " /lib64/libgcc_s.so.1\n7fffbc7"..., 270) = 270 read(7, "", 1024) = 0 close(7) = 0 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 tgkill(10042, 10042, SIGABRT) = 0 --- SIGABRT (Aborted) @ 0 (0) --- ------- Additional comments from megware Mon Feb 8 07:45:08 -0700 2010 ------- do you need any further information or tests? ------- Additional comments from megware Wed Sep 22 04:56:42 -0700 2010 ------- observations found on the ge-users list just now: === Date: Wed, 22 Sep 2010 13:39:04 +0200 From: juanjo <juanjo.gutierrez@jeppesen.com> To: users@gridengine.sunsource.net Subject: Re: [GE users] anyone running GE 6.2u{5,6} on openSUSE 11.3 On Fri, 6 Aug 2010 14:06:46 +0200 reuti wrote: > > the issue 3194 "sge_shepherd segfault on OpenSuSE 11.2 (x86_64)" is > > still open. Has anyone managed to run GE 6.2u{5,6} on openSUSE 11.2 > > or the recently released openSUSE 11.3 ? > > my experience with this issue is a little bit weird: on some systems > with 11.2 it's working w/o any problems, on others it's crashing. As > these are desktop machines with different mainboards (where we run > SGE local) my observations are further, that systems with the same > mainboards share the behavior. So it could be a side effect of one of > the kernel modules in interaction with SGE. We were testing SGE 6.2u5 on SuSE 11 and run into this same issue. After some fiddling around we've found out that starting the nscd daemon makes sge_shepherd behave correctly. It might also serve as an indication of what is wrong with the source. According to the strace present on the issue, nscd was also not started. I don't seem to be able to update the issue on the issue tracker, so someone that does, please do :) /juanjo === I'll see whether nscd is/was running on that system. stephan
Change History (2)
comment:1 Changed 10 years ago by dlove
- Resolution set to duplicate
- Severity set to minor
- Status changed from new to closed
comment:2 Changed 10 years ago by admin
- Description modified (diff)
Note: See
TracTickets for help on using
tickets.
Duplicate of IZ3193.