[GE issues] [Issue 3251] New - Repeated qmaster SEGVs

fx d.love at liverpool.ac.uk
Mon Mar 15 17:15:51 GMT 2010


http://gridengine.sunsource.net/issues/show_bug.cgi?id=3251
                 Issue #|3251
                 Summary|Repeated qmaster SEGVs
               Component|gridengine
                 Version|6.2u5
                Platform|All
                     URL|
              OS/Version|All
                  Status|NEW
       Status whiteboard|
                Keywords|
              Resolution|
              Issue type|DEFECT
                Priority|P2
            Subcomponent|qmaster
             Assigned to|ernst
             Reported by|fx






------- Additional comments from fx at sunsource.net Mon Mar 15 10:15:49 -0700 2010 -------
I and others are seeing persistent SEGVs of qmaster for no apparent
reason (e.g. see the thread at
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248192).
I'm currently seeing one every 1-2 hours and they started without any
recent changes to the configuration to trigger them, or any major
change in the sort of jobs running as far as I can tell.

So far I've managed to trap an instance under gdb and have dumped a
core file, but obviously can't keep that session going.  The following
is with 6.2u5 on x86_64 RedHat 5 (although with a CentOS+ kernel).
The backtrace is

  #0  0x0000000000587b5d in lCopySwitchPack (sep=0x2aaaaaf1ac40, 
      dep=0x2aaac5461000, src_idx=6, dst_idx=6, isHash=true, ep=0x0, pb=0x0)
      at ../libs/cull/cull_list.c:393
  #1  0x0000000000587e4c in lCopyElemHash (ep=0x2aaaaaf1ac40, isHash=true)
      at ../libs/cull/cull_list.c:180
  #2  0x0000000000587f35 in lCopyListHash (name=<value optimized out>, 
      src=<value optimized out>, hash=true) at ../libs/cull/cull_list.c:1580
  #3  0x00000000005912ac in lSelectHashPack (name=0x2aaaab36cda0 "pe_tasks", 
      slp=0x2aaaab36e2c0, cp=0x0, enp=0x6, isHash=true, pb=0x0)
      at ../libs/cull/cull_db.c:833
  #4  0x0000000000587b74 in lCopySwitchPack (sep=<value optimized out>, 
      dep=0x2aaac5460dc0, src_idx=6, dst_idx=<value optimized out>, 
      isHash=true, ep=0x6, pb=0x0) at ../libs/cull/cull_list.c:393
  #5  0x0000000000587e4c in lCopyElemHash (ep=0x2aaaab370d80, isHash=true)
      at ../libs/cull/cull_list.c:180
  #6  0x0000000000587f35 in lCopyListHash (name=<value optimized out>, 
      src=<value optimized out>, hash=true) at ../libs/cull/cull_list.c:1580
  #7  0x00000000005912ac in lSelectHashPack (name=0x2aaaab57cc10 "", 
      slp=0x2aaaab36e290, cp=0x0, enp=0x6, isHash=true, pb=0x0)
      at ../libs/cull/cull_db.c:833
  #8  0x0000000000587b74 in lCopySwitchPack (sep=<value optimized out>, 
      dep=0x2aaac5460940, src_idx=6, dst_idx=<value optimized out>, 
      isHash=true, ep=0x6, pb=0x0) at ../libs/cull/cull_list.c:393
  #9  0x0000000000587e4c in lCopyElemHash (ep=0x2aaaab370b80, isHash=true)
      at ../libs/cull/cull_list.c:180
  #10 0x0000000000587f35 in lCopyListHash (name=<value optimized out>, 
      src=<value optimized out>, hash=true) at ../libs/cull/cull_list.c:1580
  #11 0x0000000000433806 in sge_scheduler_main (arg=0x2aaaab2ed2a0)
      at ../daemons/qmaster/sge_thread_scheduler.c:791
  #12 0x0000003910406617 in start_thread () from /lib64/libpthread.so.0
  #13 0x000000390f8d3c2d in clone () from /lib64/libc.so.6
  
and the problem seems to be that glp (== str) is bogus:

  (gdb) p sep->cont[src_idx]
  $7 = {fl = 2.80259693e-45, db = 2.3176438028923692e-310, ul = 2, 
    l = 46909632806914, c = 2 '\002', b = 2 '\002', i = 2, 
    str = 0x2aaa00000002 <Address 0x2aaa00000002 out of bounds>, 
    glp = 0x2aaa00000002, obj = 0x2aaa00000002, ref = 0x2aaa00000002, 
    host = 0x2aaa00000002 <Address 0x2aaa00000002 out of bounds>, 
    cp = 0x2aaa00000002}

in the else arm of

     case lListT:
        if ((tlp = sep->cont[src_idx].glp) == NULL) 
           dep->cont[dst_idx].glp = NULL;
        else {
           dep->cont[dst_idx].glp = lSelectHashPack(tlp->listname, tlp, NULL, 
                                                    ep, isHash, pb);

I haven't been able to look at it for long.  I'll try to do some more
investigation and add information here if I get anything useful.  Unfortunately
this is a build with default RPM optimization which has various variables
optimized out, so I may need to rebuild it too.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=248772

To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list