Opened 7 years ago

Closed 7 years ago

#789 closed defect (duplicate)

IZ3251: Repeated qmaster SEGVs

Reported by: fx Owned by:
Priority: high Milestone:
Component: sge Version: 6.2u5
Severity: minor Keywords: qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3251]

        Issue #:      3251             Platform:     All      Reporter: fx (fx)
       Component:     gridengine          OS:        All
     Subcomponent:    qmaster          Version:      6.2u5       CC:    None defined
        Status:       NEW              Priority:     P2
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
          URL:
       * Summary:     Repeated qmaster SEGVs
   Status whiteboard:
      Attachments:

     Issue 3251 blocks:
   Votes for issue 3251:  38


   Opened: Mon Mar 15 10:15:00 -0700 2010 
------------------------


I and others are seeing persistent SEGVs of qmaster for no apparent
reason (e.g. see the thread at
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248192).
I'm currently seeing one every 1-2 hours and they started without any
recent changes to the configuration to trigger them, or any major
change in the sort of jobs running as far as I can tell.

So far I've managed to trap an instance under gdb and have dumped a
core file, but obviously can't keep that session going.  The following
is with 6.2u5 on x86_64 RedHat 5 (although with a CentOS+ kernel).
The backtrace is

  #0  0x0000000000587b5d in lCopySwitchPack (sep=0x2aaaaaf1ac40,
      dep=0x2aaac5461000, src_idx=6, dst_idx=6, isHash=true, ep=0x0, pb=0x0)
      at ../libs/cull/cull_list.c:393
  #1  0x0000000000587e4c in lCopyElemHash (ep=0x2aaaaaf1ac40, isHash=true)
      at ../libs/cull/cull_list.c:180
  #2  0x0000000000587f35 in lCopyListHash (name=<value optimized out>,
      src=<value optimized out>, hash=true) at ../libs/cull/cull_list.c:1580
  #3  0x00000000005912ac in lSelectHashPack (name=0x2aaaab36cda0 "pe_tasks",
      slp=0x2aaaab36e2c0, cp=0x0, enp=0x6, isHash=true, pb=0x0)
      at ../libs/cull/cull_db.c:833
  #4  0x0000000000587b74 in lCopySwitchPack (sep=<value optimized out>,
      dep=0x2aaac5460dc0, src_idx=6, dst_idx=<value optimized out>,
      isHash=true, ep=0x6, pb=0x0) at ../libs/cull/cull_list.c:393
  #5  0x0000000000587e4c in lCopyElemHash (ep=0x2aaaab370d80, isHash=true)
      at ../libs/cull/cull_list.c:180
  #6  0x0000000000587f35 in lCopyListHash (name=<value optimized out>,
      src=<value optimized out>, hash=true) at ../libs/cull/cull_list.c:1580
  #7  0x00000000005912ac in lSelectHashPack (name=0x2aaaab57cc10 "",
      slp=0x2aaaab36e290, cp=0x0, enp=0x6, isHash=true, pb=0x0)
      at ../libs/cull/cull_db.c:833
  #8  0x0000000000587b74 in lCopySwitchPack (sep=<value optimized out>,
      dep=0x2aaac5460940, src_idx=6, dst_idx=<value optimized out>,
      isHash=true, ep=0x6, pb=0x0) at ../libs/cull/cull_list.c:393
  #9  0x0000000000587e4c in lCopyElemHash (ep=0x2aaaab370b80, isHash=true)
      at ../libs/cull/cull_list.c:180
  #10 0x0000000000587f35 in lCopyListHash (name=<value optimized out>,
      src=<value optimized out>, hash=true) at ../libs/cull/cull_list.c:1580
  #11 0x0000000000433806 in sge_scheduler_main (arg=0x2aaaab2ed2a0)
      at ../daemons/qmaster/sge_thread_scheduler.c:791
  #12 0x0000003910406617 in start_thread () from /lib64/libpthread.so.0
  #13 0x000000390f8d3c2d in clone () from /lib64/libc.so.6

and the problem seems to be that glp (== str) is bogus:

  (gdb) p sep->cont[src_idx]
  $7 = {fl = 2.80259693e-45, db = 2.3176438028923692e-310, ul = 2,
    l = 46909632806914, c = 2 '\002', b = 2 '\002', i = 2,
    str = 0x2aaa00000002 <Address 0x2aaa00000002 out of bounds>,
    glp = 0x2aaa00000002, obj = 0x2aaa00000002, ref = 0x2aaa00000002,
    host = 0x2aaa00000002 <Address 0x2aaa00000002 out of bounds>,
    cp = 0x2aaa00000002}

in the else arm of

     case lListT:
        if ((tlp = sep->cont[src_idx].glp) == NULL)
           dep->cont[dst_idx].glp = NULL;
        else {
           dep->cont[dst_idx].glp = lSelectHashPack(tlp->listname, tlp, NULL,
                                                    ep, isHash, pb);

I haven't been able to look at it for long.  I'll try to do some more
investigation and add information here if I get anything useful.  Unfortunately
this is a build with default RPM optimization which has various variables
optimized out, so I may need to rebuild it too.

   ------- Additional comments from fx Tue Mar 16 08:18:26 -0700 2010 -------
I haven't been able to get to grips with the code to debug this yet,
but the crash is always at that location in ~10 core dumps on our
system.  I think someone else reported SEGVs which weren't consistent
like that.

   ------- Additional comments from fx Fri Mar 19 09:56:15 -0700 2010 -------
In fact, there is at least one other failure mode.  I also have this backtrace,
where it's freeing event_list:

#0  0x00000000005c5787 in cull_hash_free_descr (descr=0x38003800000034)
    at ../libs/cull/cull_hash.c:642
#1  0x00000000005c3526 in lFreeList (lp=0x2aaaab61e478)
    at ../libs/cull/cull_list.c:1217
#2  0x00000000005c33ea in lFreeElem (ep1=0x47ff8af8)
    at ../libs/cull/cull_list.c:1153
#3  0x00000000005c3f33 in lRemoveElem (lp=0x2aaaab665b00, ep1=0x47ff8af8)
    at ../libs/cull/cull_list.c:1796
#4  0x00000000005c3547 in lFreeList (lp=0x47ff8dd8)
    at ../libs/cull/cull_list.c:1222
#5  0x000000000043416e in sge_scheduler_main (arg=0x2aaaaaf23790)
    at ../daemons/qmaster/sge_thread_scheduler.c:663
#6  0x0000003910406617 in start_thread () from /lib64/libpthread.so.0
#7  0x000000390f8d3c2d in clone () from /lib64/libc.so.6

Change History (1)

comment:1 Changed 7 years ago by dlove

  • Resolution set to duplicate
  • Severity set to minor
  • Status changed from new to closed

Duplicate of IZ3216, fixed by [3511].

Note: See TracTickets for help on using tickets.