Opened 7 years ago

Closed 6 years ago

Last modified 6 years ago

#749 closed defect (duplicate)

IZ3194: sge_shepherd segfault on OpenSuSE 11.2 (x86_64)

Reported by: megware Owned by:
Priority: highest Milestone:
Component: sge Version: current
Severity: minor Keywords: PC Linux execution
Cc:

Description (last modified by admin)

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3194]

        Issue #:      3194             Platform:     PC        Reporter: megware (megware)
       Component:     gridengine          OS:        Linux
     Subcomponent:    execution        Version:      current      CC:    None defined
        Status:       REOPENED         Priority:     P1
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    shaas (shaas)
      QA Contact:     pollinger
          URL:
       * Summary:     sge_shepherd segfault on OpenSuSE 11.2 (x86_64)
   Status whiteboard:
      Attachments:
                      Date/filename:                                       Description:                                                  Submitted by:
                      Mon Jan 25 06:27:00 -0700 2010: shepherd.strace.9389 strace -f -ff sge_shepherd, main process (text/plain)         megware
                      Mon Jan 25 06:28:00 -0700 2010: shepherd.strace.9390 strace -f -ff sge_shepherd, forked child process (text/plain) megware

     Issue 3194 blocks:
   Votes for issue 3194:  19


   Opened: Thu Nov 26 07:25:00 -0700 2009 
------------------------


the sge_shepherd seems to die instantaneously at job startup. When I run qsub (ie. echo "sleep 10"|qsub) and watch the scheduling with qstat
the job will go to  running state ('r') and disappears from the jobs list right thereafter.

The qmaster log always shows a line like this in that moment:
11/26/2009 15:01:38|worker|frontend1|I|removing trigger to terminate job 6.1
11/26/2009 15:01:38|worker|frontend1|W|job 6.1 failed on host node11.service assumedly after job because: job 6.1 died through signal SEGV (11)

checking node11.service reveals this in dmesg:
sge_shepherd[16061]: segfault at 2b9670000000 ip 00002b9673450939 sp 00007fff37f61d40 error 4 in libc-2.10.1.so[2b96733d9000+151000]

(there appears one line per job started)

   ------- Additional comments from megware Fri Nov 27 02:40:06 -0700 2009 -------
this one is quite important to me (and I hope I am not the only one using OpenSuSE 11.2?). Please let me know  if you need additional input
and/or want me to try something.

cheers,
stephan

   ------- Additional comments from shaas Fri Nov 27 03:14:27 -0700 2009 -------
I'll take care of it!

   ------- Additional comments from shaas Fri Nov 27 03:19:35 -0700 2009 -------
This error seems to be related to issue 3193 and issue 3192 which is already fixed in maintrunk.

What is happening if the environment variable MALLOC_CHECK_ is set to 0?
Can you reproduce it (be aware that this setting might also affect other applications)?

   ------- Additional comments from megware Fri Nov 27 03:56:46 -0700 2009 -------
I modified /etc/init.d/sgeexecd to set MALLOC_CHECK_ right before sgeexecd is invoked and tried with two approaches:

[...]
      export MALLOC_CHECK_=0
      $bin_dir/sge_execd
[...]

and

[...]
      MALLOC_CHECK_=0 $bin_dir/sge_execd
[...]

neither of which caused a difference in behaviour, ie. the job still dies and there is SIGSEGV reported in qmaster log and a segfault line
in dmesg on the node.

Interestingly I so far failed to reproduce this on a virtual machine installation here. The main difference to the cluster is that there's a
more complex setup involving NIS and automount for user homes (I can't really simulate this). Could that be related?

stephan

   ------- Additional comments from shaas Fri Nov 27 05:03:25 -0700 2009 -------
As it only appears in combination with NIS this is a known bug and it's already fixed in maintrunk (SH-2009-10-29-0).
See issue 3193 for more information.

*** This issue has been marked as a duplicate of 3193 ***

   ------- Additional comments from megware Fri Jan 22 02:29:09 -0700 2010 -------
I upgraded to 6.2u5 and it is not solved. The error output looks a bit different now. I see

01/22/2010 10:13:36|worker|frontend1|I|removing trigger to terminate job 12.1
01/22/2010 10:13:36|worker|frontend1|W|job 12.1 failed on host node02.service assumedly after job because: job 12.1 died through signal ABRT (6)

in the qmaster messages file. The signal is now ABRT, not SEGV. As a consequence (?) there is no segfault line in dmesg on the computer
nodes anymore. Increasing the log level to info does not reveal more information.

Maybe this is a different problem and not related to 3193/3192?

   ------- Additional comments from reuti Fri Jan 22 04:12:30 -0700 2010 -------
There was also a discussion on the mailing list about ABRT http://gridengine.sunsource.net/ds/viewMessage.do?dsMessageId=222925&dsForumId=38
which stopped at some point. I wonder where the sigabrt is coming from. I think it's not generated by SGE but by something else and you only
see the result of it. I checked SGE's source and AFAICS it's not send therein anywhere.

   ------- Additional comments from megware Fri Jan 22 05:50:08 -0700 2010 -------
is sge_execd/sge_shepherd doing anything related to UIDs or GIDs during the early phases of a job?

I ask because a key difference between my test system (where this does _not_ happen) and the cluster (where it happens) is NIS.

   ------- Additional comments from megware Fri Jan 22 06:30:30 -0700 2010 -------
the ABRT discussion on the mailing list sounds very much like the same thing.

resources do not seem to matter much. Any job here has this problem. I can trigger it with running something as simple as

$ echo "sleep 10; hostname" | qsub
Your job 12 ("STDIN") has been submitted
$

If I call a binary (that is known to work) or do something like 'touch $HOME/somefile', with or without job script, sge does not seem to
actually invoke it.

   ------- Additional comments from reuti Fri Jan 22 06:43:24 -0700 2010 -------
Any hints when you use:

echo "strace sleep 10; strace hostname" | qsub

   ------- Additional comments from megware Fri Jan 22 07:21:31 -0700 2010 -------
nope. Gives nothing but the usual:

01/22/2010 15:13:21|worker|frontend1|W|job 13.1 failed on host node02.service assumedly after job because: job 13.1 died through signal ABRT (6)

running it directly works:

$ ssh node02.service strace sleep 10
execve("/bin/sleep", ["sleep", "10"], [/* 50 vars */]) = 0
brk(0)                                  = 0x606000
[...]
clock_gettime(CLOCK_MONOTONIC, {2781295, 74039388}) = 0
nanosleep({10, 0}, NULL)                = 0
clock_gettime(CLOCK_MONOTONIC, {2781305, 74160613}) = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)

when watching ps output on the node during a job I never see a 'sleep' process. There is only sge_shepherd in 'Defunct' state visible (I
might be too slow and not catch it in the right moment however)



   ------- Additional comments from reuti Sat Jan 23 07:21:49 -0700 2010 -------
I just checked on a openSUSE 11.2 system and the latest binaries are working fine for me, even when using local NIS.

What about renaming the sge_shepherd to sge_shepherd.orig and use strace sge_shepherd > /tmp/strace.shepherd in a wrapper script. The other
option is to try the debugging facility inside SGE: http://blogs.sun.com/templedf/entry/using_debugging_output

   ------- Additional comments from megware Mon Jan 25 05:35:40 -0700 2010 -------
good idea. Here we go:

$ echo "sleep 10;" | qsub
Your job 15 ("STDIN") has been submitted

node02:/var/spool/sge/node02 # less /tmp/shepherd.strace
01/25/2010 13:29:22 [0:9087]: shepherd called with uid = 0, euid = 0
01/25/2010 13:29:22 [104:9087]: starting up 6.2u5
01/25/2010 13:29:22 [104:9087]: setpgid(9087, 9087) returned 0
01/25/2010 13:29:22 [104:9087]: do_core_binding: "binding" parameter not found in config file
01/25/2010 13:29:22 [104:9087]: no prolog script to start
01/25/2010 13:29:22 [104:9088]: child: starting son(job, /var/spool/sge/node02/job_scripts/15, 0);
01/25/2010 13:29:22 [104:9087]: parent: forked "job" with pid 9088
01/25/2010 13:29:22 [104:9087]: parent: job-pid: 9088
01/25/2010 13:29:22 [104:9087]: wait3 returned 9088 (status: 6; WIFSIGNALED: 1,  WIFEXITED: 0, WEXITSTATUS: 0)
01/25/2010 13:29:22 [104:9087]: job exited with exit status 0
01/25/2010 13:29:22 [104:9087]: reaped "job" with pid 9088
01/25/2010 13:29:22 [104:9087]: job exited due to signal
01/25/2010 13:29:22 [104:9087]: job signaled: 6
01/25/2010 13:29:22 [104:9087]: now sending signal KILL to pid -9088
01/25/2010 13:29:22 [104:9087]: writing usage file to "usage"
01/25/2010 13:29:22 [104:9087]: no tasker to notify
01/25/2010 13:29:22 [104:9087]: no epilog script to start
--

hmm, does not look like strace.I see if I failed to catch stderr and will try -f. But maybe the above gives some hint already?


   ------- Additional comments from megware Mon Jan 25 06:27:29 -0700 2010 -------
Created an attachment (id=197)
strace -f -ff sge_shepherd, main process

   ------- Additional comments from megware Mon Jan 25 06:28:11 -0700 2010 -------
Created an attachment (id=198)
strace -f -ff sge_shepherd, forked child process

   ------- Additional comments from megware Mon Jan 25 06:32:16 -0700 2010 -------
please take a look at the two attached files. In the forked child process (shepherd.strace.9390) appears a glibc 'free(): invalid pointer'
error. Now I dont understand whats causing it but that seems to trigger the ABRT signal.

   ------- Additional comments from reuti Mon Jan 25 07:03:48 -0700 2010 -------
Yep, this is something I also found about a possible cause of the ABRT signal. But why is it not happening to me? Maybe it depends on the
actual configuration of certain queues or alike. Can you now try again with your:

[...]
      export MALLOC_CHECK_=0
      $bin_dir/sge_execd
[...]

You are using 6.2u5?

   ------- Additional comments from shaas Mon Jan 25 07:06:06 -0700 2010 -------
This should be fixed in maintrunk.
Could you recompile grid engine with latest jemalloc.c from maintrunk (version 1.6) and give it a try?

   ------- Additional comments from megware Mon Jan 25 08:02:19 -0700 2010 -------
I recompiled the most recent code from maintrunk, including jemalloc.c v1.6 and moved the new sge_shepherd binary into the installation
(sufficient? Or do I have to replace other binaries/libs as well?).

Its still crashing and does not look much different to me:

[...]
read(6, "# /etc/default/nss\n# This file c"..., 4096) = 1685
read(6, "", 4096)                       = 0
open("/dev/tty", O_RDWR|O_NOCTTY|O_NONBLOCK) = -1 ENXIO (No such device or address)
writev(2, [{"*** glibc detected *** ", 23}, {"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {": ", 2}, {"free(): invalid pointer", 23}, {":
0x", 4}, {"00002b7aee34e080", 16}, {" ***\n", 5}], 7) = 122
open("/etc/ld.so.cache", O_RDONLY)      = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=42808, ...}) = 0
mmap(NULL, 42808, PROT_READ, MAP_PRIVATE, 7, 0) = 0x2b7aee401000
close(7)                                = 0
open("/lib64/libgcc_s.so.1", O_RDONLY)  = 7
read(7, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200-\0\0\0\0\0\0"..., 832) = 832
fstat(7, {st_mode=S_IFREG|0755, st_size=92648, ...}) = 0
mmap(NULL, 2188280, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 7, 0) = 0x2b7aef4fa000
fadvise64(7, 0, 2188280, POSIX_FADV_WILLNEED) = 0
mprotect(0x2b7aef510000, 2093056, PROT_NONE) = 0
mmap(0x2b7aef70f000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 7, 0x15000) = 0x2b7aef70f000
close(7)                                = 0
mprotect(0x2b7aef70f000, 4096, PROT_READ) = 0
munmap(0x2b7aee401000, 42808)           = 0
futex(0x2b7aeeec65b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x2b7aef710190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(2, "======= Backtrace: =========\n", 29) = 29
writev(2, [{"/lib64/libc.so.6", 16}, {"[0x", 3}, {"2b7aeebdfc76", 12}, {"]\n", 2}], 4) = 33
writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"cfree", 5}, {"+0x", 3}, {"6c", 2}, {")", 1}, {"[0x", 3}, {"2b7aeebe496c", 12}, {"]\n", 2}],
9) = 45
writev(2, [{"/lib64/libnsl.so.1", 18}, {"[0x", 3}, {"2b7aef2f1c2a", 12}, {"]\n", 2}], 4) = 35
writev(2, [{"/lib64/libpthread.so.0", 22}, {"(", 1}, {"pthread_once", 12}, {"+0x", 3}, {"53", 2}, {")", 1}, {"[0x", 3}, {"2b7aee95c0f3",
12}, {"]\n", 2}], 9) = 58
writev(2, [{"/lib64/libnsl.so.1", 18}, {"(", 1}, {"_nsl_default_nss", 16}, {"+0x", 3}, {"21", 2}, {")", 1}, {"[0x", 3}, {"2b7aef2f1d41",
12}, {"]\n", 2}], 9) = 58
writev(2, [{"/lib64/libnss_nis.so.2", 22}, {"(", 1}, {"_nss_nis_initgroups_dyn", 23}, {"+0x", 3}, {"6a", 2}, {")", 1}, {"[0x", 3},
{"2b7aef0dec8a", 12}, {"]\n", 2}], 9) = 69
writev(2, [{"/lib64/libc.so.6", 16}, {"[0x", 3}, {"2b7aeec0a1c2", 12}, {"]\n", 2}], 4) = 33
writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"initgroups", 10}, {"+0x", 3}, {"6c", 2}, {")", 1}, {"[0x", 3}, {"2b7aeec0a38c", 12},
{"]\n", 2}], 9) = 50
writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"51eca4", 6}, {"]\n", 2}], 4) = 60
writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"51ea47", 6}, {"]\n", 2}], 4) = 60
writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"40d3f3", 6}, {"]\n", 2}], 4) = 60
writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"409841", 6}, {"]\n", 2}], 4) = 60
writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"409276", 6}, {"]\n", 2}], 4) = 60
writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_start_main", 17}, {"+0x", 3}, {"fd", 2}, {")", 1}, {"[0x", 3}, {"2b7aeeb8ba7d", 12},
{"]\n", 2}], 9) = 57
writev(2, [{"/opt/sge/bin/lx24-amd64/sge_shep"..., 49}, {"[0x", 3}, {"408009", 6}, {"]\n", 2}], 4) = 60
write(2, "======= Memory map: ========\n", 29) = 29
open("/proc/self/maps", O_RDONLY)       = 7
read(7, "00400000-00576000 r-xp 00000000 "..., 1024) = 1024
write(2, "00400000-00576000 r-xp 00000000 "..., 1024) = 1024
read(7, "o\n2b7aee6f9000-2b7aee6fa000 r--p"..., 1024) = 1024
write(2, "o\n2b7aee6f9000-2b7aee6fa000 r--p"..., 1024) = 1024
read(7, ":00 0 \n2b7aeeb6d000-2b7aeecbe000"..., 1024) = 1024
write(2, ":00 0 \n2b7aeeb6d000-2b7aeecbe000"..., 1024) = 1024
read(7, "  /lib64/libnss_nis-2.10.1.so\n2b"..., 1024) = 1024
write(2, "  /lib64/libnss_nis-2.10.1.so\n2b"..., 1024) = 1024
read(7, "    /lib64/libgcc_s.so.1\n7fffbc7"..., 1024) = 270
write(2, "    /lib64/libgcc_s.so.1\n7fffbc7"..., 270) = 270
read(7, "", 1024)                       = 0
close(7)                                = 0
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
tgkill(10042, 10042, SIGABRT)           = 0
--- SIGABRT (Aborted) @ 0 (0) ---


   ------- Additional comments from megware Mon Feb 8 07:45:08 -0700 2010 -------
do you need any further information or tests?
------- Additional comments from megware Wed Sep 22 04:56:42 -0700 2010 -------

observations found on the ge-users list just now:

===
Date: Wed, 22 Sep 2010 13:39:04 +0200                                                                                          
From: juanjo <juanjo.gutierrez@jeppesen.com>                                                                                   
To: users@gridengine.sunsource.net                                                                                             
Subject: Re: [GE users] anyone running GE 6.2u{5,6} on openSUSE 11.3                                                           

On Fri, 6 Aug 2010 14:06:46 +0200 reuti wrote:

> > the issue 3194 "sge_shepherd segfault on OpenSuSE 11.2 (x86_64)" is                                                        
> > still open. Has anyone managed to run GE 6.2u{5,6} on openSUSE 11.2                                                        
> > or the recently released openSUSE 11.3 ?                                                                                   
>                                                                                                                              
> my experience with this issue is a little bit weird: on some systems                                                         
> with 11.2 it's working w/o any problems, on others it's crashing. As                                                         
> these are desktop machines  with different mainboards (where we run                                                          
> SGE local) my observations are further, that systems with the same                                                           
> mainboards share the behavior. So it could be a side effect of one of                                                        
> the kernel modules in interaction with SGE.                                                                                  

We were testing SGE 6.2u5 on SuSE 11 and run into this same issue.
After some fiddling around we've found out that starting the nscd
daemon makes sge_shepherd behave correctly. It might also serve as an
indication of what is wrong with the source.

According to the strace present on the issue, nscd was also not
started. I don't seem to be able to update the issue on the issue
tracker, so someone that does, please do :)

/juanjo
===

I'll see whether nscd is/was running on that system.

stephan

Change History (2)

comment:1 Changed 6 years ago by dlove

  • Resolution set to duplicate
  • Severity set to minor
  • Status changed from new to closed

Duplicate of IZ3193.

comment:2 Changed 6 years ago by admin

  • Description modified (diff)
Note: See TracTickets for help on using tickets.