Opened 11 years ago
Closed 10 years ago
#789 closed defect (duplicate)
IZ3251: Repeated qmaster SEGVs
Reported by: | fx | Owned by: | |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | sge | Version: | 6.2u5 |
Severity: | minor | Keywords: | qmaster |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3251]
Issue #: 3251 Platform: All Reporter: fx (fx) Component: gridengine OS: All Subcomponent: qmaster Version: 6.2u5 CC: None defined Status: NEW Priority: P2 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: ernst (ernst) QA Contact: ernst URL: * Summary: Repeated qmaster SEGVs Status whiteboard: Attachments: Issue 3251 blocks: Votes for issue 3251: 38 Opened: Mon Mar 15 10:15:00 -0700 2010 ------------------------ I and others are seeing persistent SEGVs of qmaster for no apparent reason (e.g. see the thread at http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248192). I'm currently seeing one every 1-2 hours and they started without any recent changes to the configuration to trigger them, or any major change in the sort of jobs running as far as I can tell. So far I've managed to trap an instance under gdb and have dumped a core file, but obviously can't keep that session going. The following is with 6.2u5 on x86_64 RedHat 5 (although with a CentOS+ kernel). The backtrace is #0 0x0000000000587b5d in lCopySwitchPack (sep=0x2aaaaaf1ac40, dep=0x2aaac5461000, src_idx=6, dst_idx=6, isHash=true, ep=0x0, pb=0x0) at ../libs/cull/cull_list.c:393 #1 0x0000000000587e4c in lCopyElemHash (ep=0x2aaaaaf1ac40, isHash=true) at ../libs/cull/cull_list.c:180 #2 0x0000000000587f35 in lCopyListHash (name=<value optimized out>, src=<value optimized out>, hash=true) at ../libs/cull/cull_list.c:1580 #3 0x00000000005912ac in lSelectHashPack (name=0x2aaaab36cda0 "pe_tasks", slp=0x2aaaab36e2c0, cp=0x0, enp=0x6, isHash=true, pb=0x0) at ../libs/cull/cull_db.c:833 #4 0x0000000000587b74 in lCopySwitchPack (sep=<value optimized out>, dep=0x2aaac5460dc0, src_idx=6, dst_idx=<value optimized out>, isHash=true, ep=0x6, pb=0x0) at ../libs/cull/cull_list.c:393 #5 0x0000000000587e4c in lCopyElemHash (ep=0x2aaaab370d80, isHash=true) at ../libs/cull/cull_list.c:180 #6 0x0000000000587f35 in lCopyListHash (name=<value optimized out>, src=<value optimized out>, hash=true) at ../libs/cull/cull_list.c:1580 #7 0x00000000005912ac in lSelectHashPack (name=0x2aaaab57cc10 "", slp=0x2aaaab36e290, cp=0x0, enp=0x6, isHash=true, pb=0x0) at ../libs/cull/cull_db.c:833 #8 0x0000000000587b74 in lCopySwitchPack (sep=<value optimized out>, dep=0x2aaac5460940, src_idx=6, dst_idx=<value optimized out>, isHash=true, ep=0x6, pb=0x0) at ../libs/cull/cull_list.c:393 #9 0x0000000000587e4c in lCopyElemHash (ep=0x2aaaab370b80, isHash=true) at ../libs/cull/cull_list.c:180 #10 0x0000000000587f35 in lCopyListHash (name=<value optimized out>, src=<value optimized out>, hash=true) at ../libs/cull/cull_list.c:1580 #11 0x0000000000433806 in sge_scheduler_main (arg=0x2aaaab2ed2a0) at ../daemons/qmaster/sge_thread_scheduler.c:791 #12 0x0000003910406617 in start_thread () from /lib64/libpthread.so.0 #13 0x000000390f8d3c2d in clone () from /lib64/libc.so.6 and the problem seems to be that glp (== str) is bogus: (gdb) p sep->cont[src_idx] $7 = {fl = 2.80259693e-45, db = 2.3176438028923692e-310, ul = 2, l = 46909632806914, c = 2 '\002', b = 2 '\002', i = 2, str = 0x2aaa00000002 <Address 0x2aaa00000002 out of bounds>, glp = 0x2aaa00000002, obj = 0x2aaa00000002, ref = 0x2aaa00000002, host = 0x2aaa00000002 <Address 0x2aaa00000002 out of bounds>, cp = 0x2aaa00000002} in the else arm of case lListT: if ((tlp = sep->cont[src_idx].glp) == NULL) dep->cont[dst_idx].glp = NULL; else { dep->cont[dst_idx].glp = lSelectHashPack(tlp->listname, tlp, NULL, ep, isHash, pb); I haven't been able to look at it for long. I'll try to do some more investigation and add information here if I get anything useful. Unfortunately this is a build with default RPM optimization which has various variables optimized out, so I may need to rebuild it too. ------- Additional comments from fx Tue Mar 16 08:18:26 -0700 2010 ------- I haven't been able to get to grips with the code to debug this yet, but the crash is always at that location in ~10 core dumps on our system. I think someone else reported SEGVs which weren't consistent like that. ------- Additional comments from fx Fri Mar 19 09:56:15 -0700 2010 ------- In fact, there is at least one other failure mode. I also have this backtrace, where it's freeing event_list: #0 0x00000000005c5787 in cull_hash_free_descr (descr=0x38003800000034) at ../libs/cull/cull_hash.c:642 #1 0x00000000005c3526 in lFreeList (lp=0x2aaaab61e478) at ../libs/cull/cull_list.c:1217 #2 0x00000000005c33ea in lFreeElem (ep1=0x47ff8af8) at ../libs/cull/cull_list.c:1153 #3 0x00000000005c3f33 in lRemoveElem (lp=0x2aaaab665b00, ep1=0x47ff8af8) at ../libs/cull/cull_list.c:1796 #4 0x00000000005c3547 in lFreeList (lp=0x47ff8dd8) at ../libs/cull/cull_list.c:1222 #5 0x000000000043416e in sge_scheduler_main (arg=0x2aaaaaf23790) at ../daemons/qmaster/sge_thread_scheduler.c:663 #6 0x0000003910406617 in start_thread () from /lib64/libpthread.so.0 #7 0x000000390f8d3c2d in clone () from /lib64/libc.so.6
Change History (1)
comment:1 Changed 10 years ago by dlove
- Resolution set to duplicate
- Severity set to minor
- Status changed from new to closed
Note: See
TracTickets for help on using
tickets.
Duplicate of IZ3216, fixed by [3511].