Opened 7 years ago

Closed 7 years ago

#1441 closed defect (fixed)

SoGE 8.1.2 qmaster segfault problem

Reported by: Andreas.Loong@… Owned by: Dave Love <d.love@…>
Priority: normal Milestone:
Component: sge Version: 8.1.2
Severity: minor Keywords:
Cc:

Description

I couldn't get the debug binaries to run properly, however another user
sent me the below output - I hope it's enough:


From: baf035 baf035@…
Sent: den 2 november 2012 14:34
To: Loong, Andreas
Subject: Re: [gridengine users] SoGE 8.1.2 segfault problem


Hi,

I can validate described behaviour:
SoGE compiled with -debug,
qmaster server system:
cat /etc/SuSE-release
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 1
~# uname -r
2.6.32.59-0.7-xen

SGE_ND=1 gdb -batch -ex run -ex 'bt full' sge_qmaster | tee
sge_master_gdb2.log


Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2
Try: zypper install -C
"debuginfo(build-id)=c1807b5762068e6c5f4a6a0ed48d9d4469965351"
Missing separate debuginfo for /usr/lib64/libssl.so.0.9.8
Try: zypper install -C
"debuginfo(build-id)=d18ef9c9ddb90ed79b550ba6399c00874bc86345"
Missing separate debuginfo for /usr/lib64/libcrypto.so.0.9.8
Try: zypper install -C
"debuginfo(build-id)=abcd98fb64029fea0fc96116be5f178a429e63d5"
Missing separate debuginfo for /lib64/libdl.so.2
Try: zypper install -C
"debuginfo(build-id)=f607b21f9a513c99bba9539050c01236d19bf22b"
Missing separate debuginfo for /lib64/libm.so.6
Try: zypper install -C
"debuginfo(build-id)=4e9fa1a2c1141fc0123a142783efd044c40bdaaf"
Missing separate debuginfo for /lib64/libpthread.so.0
Try: zypper install -C
"debuginfo(build-id)=341d7c595fd2db49df98b8a6ae2c319f46b43c5b"
Missing separate debuginfo for /lib64/libc.so.6
Try: zypper install -C
"debuginfo(build-id)=9e0264386fde8570b215fd4c32465fdda3c1c996"
[Thread debugging using libthread_db enabled]
Missing separate debuginfo for /lib64/libz.so.1
Try: zypper install -C
"debuginfo(build-id)=4c05d1eb180f9c02b81a0c559c813dada91e0ca4"
[New Thread 0x7ffff69ff700 (LWP 31685)]
[New Thread 0x7ffff61fe700 (LWP 31686)]
[New Thread 0x7ffff59fd700 (LWP 31687)]
[New Thread 0x7ffff51fc700 (LWP 31688)]
Reading in Master_Job_List.

read job database with 0 entries in 0 seconds
error: error opening file
"/jms/spool/i005/sge_spool/qmaster/./sharetree" for reading: No such
file or directory
nr of dynamic event clients exceeds max file descriptor limit, setting
MAX_DYN_EC=979
qmaster hard descriptor limit is set to 8192
qmaster soft descriptor limit is set to 1024
qmaster will use max. 1004 file descriptors for communication
qmaster will accept max. 979 dynamic event clients
starting up SGE 8.1.3pre (lx-amd64)
Q:1, AQ:285 J:0(0), H:832(832), C:225, A:13, D:1, P:2, CKPT:1, US:3,
PR:9, RQS:0, AR:0, S:nd:0/lf:0


Q:277, AQ:285 J:0(0), H:832(832), C:225, A:13, D:1, P:2, CKPT:1, US:3,
PR:9, RQS:0, AR:0, S:nd:0/lf:0


Q:279, AQ:285 J:0(0), H:832(832), C:225, A:13, D:1, P:2, CKPT:1, US:3,
PR:9, RQS:0, AR:0, S:nd:0/lf:0


..

:281, AQ:285 J:0(0), H:832(832), C:225, A:13, D:1, P:2, CKPT:1, US:3,
PR:9, RQS:0, AR:0, S:nd:0/lf:0


Q:281, AQ:285 J:0(0), H:832(832), C:225, A:13, D:1, P:2, C[New Thread
0x7ffff41ff700 (LWP 31691)]
[New Thread 0x7ffff39fe700 (LWP 31692)]
[New Thread 0x7ffff31fd700 (LWP 31693)]
[New Thread 0x7ffff29fc700 (LWP 31694)]
[New Thread 0x7ffff21fb700 (LWP 31695)]
[New Thread 0x7ffff19fa700 (LWP 31696)]
[New Thread 0x7ffff11f9700 (LWP 31697)]
[New Thread 0x7ffff09f8700 (LWP 31698)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff19fa700 (LWP 31696)]
0x00000000004995c6 in do_gdi_packet (monitor=<optimized out>,
aMsg=<optimized out>, answer_list=<optimized out>, ctx=<optimized out>)
at ../daemons/qmaster/sge_qmaster_process_message.c:195
195 ../daemons/qmaster/sge_qmaster_process_message.c: No such file
or directory.

in ../daemons/qmaster/sge_qmaster_process_message.c

#0 0x00000000004995c6 in do_gdi_packet (monitor=<optimized out>,
aMsg=<optimized out>, answer_list=<optimized out>, ctx=<optimized out>)
at ../daemons/qmaster/sge_qmaster_process_message.c:195

packet = 0x0
local_ret = false
SGE_FUNC = "do_gdi_packet"

#1 sge_qmaster_process_message (ctx=0x7ffff42b6800, monitor=<optimized
out>) at ../daemons/qmaster/sge_qmaster_process_message.c:158

res = <optimized out>
msg = {snd_host = "sget5.hpc.domain.com", '\000' <repeats 40

times>, snd_name = "qstat", '\000' <repeats 58 times>, snd_id = 637, tag

2, request_mid = 2, buf = {head_ptr = 0x7fffefc17800 "", cur_ptr

0x7fffefc1783e "", mem_size = 1241, bytes_used = 62, just_count = 0,
version = 268566528}}

SGE_FUNC = "sge_qmaster_process_message"

#2 0x000000000042fb74 in sge_listener_main (arg=<optimized out>) at
../daemons/qmaster/sge_thread_listener.c:169

thread_config = 0x7ffff4693de0
monitor = {thread_name = 0x7ffff42a7360 "listener000",

monitor_time = 0, log_monitor_mes = false, output_line1 =
0x7ffff42ab1a0, output_line2 = 0x7ffff42ab1c0, work_line =
0x7ffff42ab1c0, pos = 6, now = {tv_sec = 0, tv_usec = 0}, output =
false, message_in_count = 0, message_out_count = 0, idle = 0, wait = 0,
ext_type = LIS_EXT, ext_data = 0x7ffff42a7370, ext_data_size = 16,
ext_output = 0x5c8ac0 <ext_lis_output>}

ctx = 0x7ffff42b6800
next_prof_output = 0
SGE_FUNC = "sge_listener_main"

#3 0x00007ffff717b6a6 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4 0x00007ffff6eeaf7d in clone () from /lib64/libc.so.6
No symbol table info available.
#5 0x0000000000000000 in ?? ()
No symbol table info available.


Sge_qmaster died randomly without significant load on server:

2012-10-31 18:42:03] sge_qmaster[8181]: segfault at 68 ip

00000000004995c6 sp 00007f40acbf9c50 error 6 in
sge_qmaster[400000+267000]
[2012-10-31 20:35:03] sge_qmaster[8895]: segfault at 68 ip
00000000004995c6 sp 00007fb8b8dfac50 error 6 in
sge_qmaster[400000+267000]
[2012-11-01 04:50:03] sge_qmaster[13225]: segfault at 68 ip
00000000004995c6 sp 00007f62cd4f9c50 error 6 in
sge_qmaster[400000+267000]
[2012-11-01 08:56:04] sge_qmaster[15224]: segfault at 68 ip
00000000004995c6 sp 00007f9b727f9c50 error 6 in
sge_qmaster[400000+267000]
[2012-11-01 10:56:02] sge_qmaster[16860]: segfault at 68 ip
00000000004995c6 sp 00007f0c183f9c50 error 6 in
sge_qmaster[400000+267000]

Bye

BaF035



Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful.

Change History (4)

comment:1 Changed 7 years ago by dlove

SGE <sge-bugs@…> writes:

I couldn't get the debug binaries to run properly,

That sounds worrying in itself. What was the problem? Is it with the
rpms I made or a separate build?

however another user
sent me the below output - I hope it's enough:

Unfortunately not, but thanks for that info anyway. I can't immediately
see what the problem might be, at least with things optimized away and
maybe missing some stack. A version compiled with -no-opt might be more
helpful, but I'll see if I can guess anything when I'm more awake.

I just wonder what's different on the original system from all the
others which are running.

From: baf035 baf035@…
Sent: den 2 november 2012 14:34
To: Loong, Andreas
Subject: Re: [gridengine users] SoGE 8.1.2 segfault problem

Hi,

I can validate described behaviour:
SoGE compiled with -debug,
qmaster server system:
cat /etc/SuSE-release
SUSE Linux Enterprise Server 11 (x86_64)

Is this a regression? It's more understandable failing on SuSE than on
Red Hat, which works fine for me and others. It seems to be
communicating with qstat; is the crash just triggered by that?

comment:2 Changed 7 years ago by Andreas.Loong@…

Unfortunately not, but thanks for that info anyway. I can't
immediately
see what the problem might be, at least with things optimized away
and
maybe missing some stack. A version compiled with -no-opt might be
more
helpful, but I'll see if I can guess anything when I'm more awake.

I just wonder what's different on the original system from all the
others which are running.

Is this a regression? It's more understandable failing on SuSE than
on
Red Hat, which works fine for me and others. It seems to be
communicating with qstat; is the crash just triggered by that?

Yes, that seems to be a way to trigger the crash. I got the debug info
from my system here:

warning: no loadable sections found in added symbol-file system-supplied
DSO at 0x2aaaaaaab000
[Thread debugging using libthread_db enabled]
[New Thread 0x40a00940 (LWP 12258)]
[New Thread 0x41401940 (LWP 12259)]
[New Thread 0x41e02940 (LWP 12267)]
[New Thread 0x42803940 (LWP 12268)]
local configuration srvname.cluster not defined - using global
configuration
read job database with 1 entries in 0 seconds
nr of dynamic event clients exceeds max file descriptor limit, setting
MAX_DYN_EC=979
max dynamic event clients is set to 979
qmaster hard descriptor limit is set to 1024
qmaster soft descriptor limit is set to 1024
qmaster will use max. 1004 file descriptors for communication
qmaster will accept max. 979 dynamic event clients
starting up SGE 8.1.2 (lx-amd64)
[New Thread 0x43204940 (LWP 12269)]
[New Thread 0x43c05940 (LWP 12270)]
[New Thread 0x44606940 (LWP 12271)]
2 worker threads are enabled
[New Thread 0x45007940 (LWP 12272)]
[New Thread 0x45a08940 (LWP 12273)]
2 listener threads are enabled
[New Thread 0x46409940 (LWP 12274)]
[New Thread 0x46e0a940 (LWP 12275)]
[New Thread 0x4780b940 (LWP 12276)]
"scheduler" registers as event client with id 1 event delivery interval
10
sge_clab2dev@… added "scheduler" to event client list
using "default" as algorithm
using "0:0:30" for schedule_interval
using "0:0:0" for load_adjustment_decay_time
using "mem_total" for load_formula
using "true" for schedd_job_info
using param: "none"
using "0:0:0" for reprioritize_interval
using "cpu=0.75,mem=0.25,io=0" for usage_weight_list
using "none" for halflife_decay_list
using "OFS" for policy_hierarchy
using "NONE" for job_load_adjustments
using 0 for maxujobs
using 0 for queue_sort_method
using 1 for flush_submit_sec
using 1 for flush_finish_sec
using 144 for halftime
using 5 for compensation_factor
using 0.25 for weight_user
using 0.25 for weight_project
using 0.25 for weight_department
using 0.25 for weight_job
using 10000 for weight_tickets_functional
using 100000 for weight_tickets_share
using 1 for share_override_tickets
using 1 for share_functional_shares
using 200 for max_functional_jobs_to_schedule
using 1 for report_pjob_tickets
using 50 for max_pending_tasks_per_job
using 0.5 for weight_ticket
using 0.075 for weight_waiting_time
using 3.6e+06 for weight_deadline
using 0.5 for weight_urgency
using 1 for weight_priority
using 100 for max_reservation
Q:0, AQ:28 J:1(1), H:17(17), C:84, A:7, D:1, P:17, CKPT:0, US:64, PR:3,
RQS:0, AR:0, S:nd:0/lf:0
this was before utilization_normalize()
this was before utilization_normalize()


scheduler has been started
start of jvm thread is disabled in bootstrap file
qmaster startup took 1 seconds
execd on cla-004.cluster registered
execd on cla-001.cluster registered
execd on cla-009.cluster registered
execd on cla-010.cluster registered
execd on cla-003.cluster registered
execd on cla-014.cluster registered
execd on cla-002.cluster registered
execd on cla-006.cluster registered
execd on cla-008.cluster registered
execd on cla-013.cluster registered
execd on cla-012.cluster registered
execd on cla-005.cluster registered

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x46e0a940 (LWP 12275)]
0x000000000049e49c in do_gdi_packet (ctx=0x2aaaaf31dc00, monitor=<value
optimized out>) at ../daemons/qmaster/sge_qmaster_process_message.c:195
195 packet->host = sge_strdup(NULL, aMsg->snd_host);
#0 0x000000000049e49c in do_gdi_packet (ctx=0x2aaaaf31dc00,
monitor=<value optimized out>) at
../daemons/qmaster/sge_qmaster_process_message.c:195

packet = 0x0
local_ret = false
SGE_FUNC = "do_gdi_packet"

#1 sge_qmaster_process_message (ctx=0x2aaaaf31dc00, monitor=<value
optimized out>) at ../daemons/qmaster/sge_qmaster_process_message.c:158

res = <value optimized out>
msg = {snd_host = "health-monitoring.domain", '\000' <repeats 31

times>, snd_name = "qstat", '\000' <repeats 58 times>, snd_id = 1, tag =
2, request_mid = 1,

buf = {head_ptr = 0x2aaaaf543000 "", cur_ptr = 0x2aaaaf54308b

"", mem_size = 2728, bytes_used = 139, just_count = 0, version =
268566528}}

SGE_FUNC = "sge_qmaster_process_message"

#2 0x000000000043060e in sge_listener_main (arg=<value optimized out>)
at ../daemons/qmaster/sge_thread_listener.c:169

thread_config = 0x2aaaaab3f9c0
monitor = {thread_name = 0x2aaaaabf7de0 "listener001",

monitor_time = 0, log_monitor_mes = false, output_line1 =
0x2aaaaf30b600, output_line2 = 0x2aaaaf30b620,

work_line = 0x2aaaaf30b620, pos = 7, now = {tv_sec = 0,

tv_usec = 0}, output = false, message_in_count = 0, message_out_count =
0, idle = 0, wait = 0,

ext_type = LIS_EXT, ext_data = 0x2aaaaabf7dc0, ext_data_size =

16, ext_output = 0x5cd280 <ext_lis_output>}

ctx = 0x2aaaaf31dc00
next_prof_output = 0
SGE_FUNC = "sge_listener_main"

#3 0x000000331480677d in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4 0x0000003313cd3c1d in clone () from /lib64/libc.so.6

The system you see running qstat here has an older version of qstat,
6.2u5. Is there anything else I can do?

Andreas


Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful.

comment:3 Changed 7 years ago by dlove

You wrote:

The system you see running qstat here has an older version of qstat,
6.2u5. Is there anything else I can do?

So is the crash (simply) due to talking to a v8 qmaster with a v6
client? If so, then it should clearly produce an error message, and
obviously not crash, but I thought I'd seen it do exactly that. (You
can't mix v6 and v8 components, like you couldn't generally mix
different v6 releases.)

However, I seem to remember there's an old comment somewhere about
"should detect" that sort of thing, or words to that effect. If that is
the situation, I still don't see why changing the NSS database affects
it, but I can probably make sure it doesn't crash.

comment:4 Changed 7 years ago by Dave Love <d.love@…>

  • Owner set to Dave Love <d.love@…>
  • Resolution set to fixed
  • Status changed from new to closed

In 4383/sge:

Fix #1441: avoid qmaster crash after failing to unpack packet
Seen with v6 qstat client.
Still needs improved diagnostic.

Note: See TracTickets for help on using tickets.