Opened 12 years ago
Last modified 10 years ago
#615 new defect
IZ2842: listener threads get stuck in cl_commlib_receive_message
Reported by: | tholzer | Owned by: | |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | sge | Version: | 6.2u2 |
Severity: | Keywords: | PC Linux qmaster | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2842]
Issue #: 2842 Platform: PC Reporter: tholzer (tholzer) Component: gridengine OS: Linux Subcomponent: qmaster Version: 6.2u2 CC: None defined Status: NEW Priority: P2 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: crei (crei) QA Contact: ernst URL: * Summary: listener threads get stuck in cl_commlib_receive_message Status whiteboard: Attachments: Date/filename: Description: Submitted by: Sun Dec 21 20:53:00 -0700 2008: deps.tar.gz hold_jids (application/x-gzip) tholzer Issue 2842 blocks: Votes for issue 2842: Opened: Wed Dec 17 14:21:00 -0700 2008 ------------------------ We have started testing (V62u2beta) on a large cluster (2,500 execution hosts). We are experiencing a problem with the scheduler taking too long in the job dispatching phase when a number of execution hosts have become unreachable. This is due to the synchronous nature of job dispatches. Once we have lost connection to an execution host, the job dispatching to it will wait 60 seconds (CL_DEFINE_SYNCHRON_RECEIVE_TIMEOUT) for it to fail. Once this happens to all listener threads (2 by default), the qmaster stops responding. In a large cluster like ours, there are constantly hosts coming & going. I've also found this in the comments: /* TODO: do trigger or not? depends on syncrhron * TODO: Remove synchron flag from this function, it is only used for get_event_list call in event client. event client code should be re-written, not to use this synchron flag set to false */ The following stack trace illustrates the condition: Thread 4 (Thread 1160374592 (LWP 24654)): #0 0x0000003a8760a697 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000005fc640 in cl_thread_wait_for_thread_condition (condition=0x8d0560, sec=1, micro_sec=0) at ../libs/comm/lists/cl_thread.c:259 #2 0x00000000005ecf70 in cl_commlib_receive_message (handle=0x8b86f0, un_resolved_hostname=0x4529deb0 "", component_name=0x4529def0 "", component_id=0, synchron=CL_TRUE, response_mid=0, message=0x4529ddd0, sender=0x4529ddc8) at ../libs/comm/cl_commlib.c:4991 #3 0x0000000000532f91 in sge_gdi2_get_any_request (ctx=0xa38100, rhost=0x4529deb0 "", commproc=0x4529def0 "", id=0x4529df30, pb=0x4529df40, tag=0x4529df34, synchron=1, for_request_mid=0, mid=0x4529df38) at ../libs/gdi/sge_gdi2.c:681 #4 0x00000000004a78cb in sge_qmaster_process_message (ctx=0xa38100, monitor=0x4529e040) at ../daemons/qmaster/sge_qmaster_process_message.c:142 #5 0x000000000042cee7 in sge_listener_main (arg=0x7f240c2c3120) at ../daemons/qmaster/sge_thread_listener.c:196 #6 0x0000003a876062f7 in start_thread () from /lib64/libpthread.so.0 #7 0x00000030cbece85d in clone () from /lib64/libc.so.6 Thread 3 (Thread 1168767296 (LWP 24655)): #0 0x0000003a8760a697 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000005fc640 in cl_thread_wait_for_thread_condition (condition=0x8d0560, sec=1, micro_sec=0) at ../libs/comm/lists/cl_thread.c:259 #2 0x00000000005ecf70 in cl_commlib_receive_message (handle=0x8b86f0, un_resolved_hostname=0x45a9eeb0 "", component_name=0x45a9eef0 "", component_id=0, synchron=CL_TRUE, response_mid=0, message=0x45a9edd0, sender=0x45a9edc8) at ../libs/comm/cl_commlib.c:4991 #3 0x0000000000532f91 in sge_gdi2_get_any_request (ctx=0x7f240c2c53c0, rhost=0x45a9eeb0 "", commproc=0x45a9eef0 "", id=0x45a9ef30, pb=0x45a9ef40, tag=0x45a9ef34, synchron=1, for_request_mid=0, mid=0x45a9ef38) at ../libs/gdi/sge_gdi2.c:681 #4 0x00000000004a78cb in sge_qmaster_process_message (ctx=0x7f240c2c53c0, monitor=0x45a9f040) at ../daemons/qmaster/sge_qmaster_process_message.c:142 #5 0x000000000042cee7 in sge_listener_main (arg=0x7f240c2c3450) at ../daemons/qmaster/sge_thread_listener.c:196 #6 0x0000003a876062f7 in start_thread () from /lib64/libpthread.so.0 #7 0x00000030cbece85d in clone () from /lib64/libc.so.6 Thread 2 (Thread 1177160000 (LWP 24656)): #0 0x0000003a8760a697 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000053b5d0 in sge_gdi_packet_wait_till_handled (packet=0x7f22e18407d0) at ../libs/gdi/sge_gdi_packet_internal.c:200 #2 0x000000000053cc8e in sge_gdi_packet_wait_for_result_internal (ctx=0xa392f0, answer_list=0x0, packet=0x4629faa0, malpp=0x4629fae0) at ../libs/gdi/sge_gdi_packet_internal.c:890 #3 0x0000000000532779 in sge_gdi2_wait (ctx=0xa392f0, alpp=0x0, malpp=0x4629fae0, state=0x7f22e1840600) at ../libs/gdi/sge_gdi2.c:497 #4 0x000000000043fac2 in sge_schedd_block_until_orders_processed (ctx=0xa392f0, answer_list=0x0) at ../daemons/qmaster/sge_sched_order.c:160 #5 0x00000000004337d5 in sge_scheduler_main (arg=0x7f240c2c2d40) at ../daemons/qmaster/sge_thread_scheduler.c:913 #6 0x0000003a876062f7 in start_thread () from /lib64/libpthread.so.0 #7 0x00000030cbece85d in clone () from /lib64/libc.so.6 ------- Additional comments from tholzer Wed Dec 17 19:17:54 -0700 2008 ------- This seems to happen when submitting jobs with a large number of -hold_jid job ids (>12,000) being specified via DRMAAs native specification attribute. ------- Additional comments from crei Thu Dec 18 01:47:41 -0700 2008 ------- tholzer, can you please test if a qping -info YOUR_MASTER_HOST $SGE_QMASTER_PORT qmaster 1 gets an answer or not. If qping is working the communication layer is working and the problem is in the gdi threads. Additional information is also available when starting qmaster in debug level: >setenv SGE_ND 1 -> do not daemonize >setenv SGE_COMMLIB_DEBUG 2 (or level 3) -> get communication logging >$SGE_ROOT/bin/$ARCH/sge_qmaster ------- Additional comments from andreas Thu Dec 18 04:13:20 -0700 2008 ------- I failed to reproduce this case until now. What I understand from your description, is that qmaster gets stuck when the synchronuous job delivery in qmaster fails two times after the execd became unreachable. Have you tried to make job delivery asynchronuous? Doing so were realatively easy: Just pass '1' instead of '0' to as second parameter to gdi2_send_message_pb() in function send_job(). ------- Additional comments from andreas Thu Dec 18 05:43:17 -0700 2008 ------- Somehow you confused me: Actually job delivery from qmaster to execd is already asynchronuous. So I'm curious what you mean with "synchronous nature of job dispatches"? ------- Additional comments from tholzer Thu Dec 18 14:45:14 -0700 2008 ------- I had qping running as well and that stopped responding eventually. I've also put a line into sge_gdi2_get_any_request (sge_gdi2.c) to set synchron = 0 in all cases. This made matters worse as something seemed to be trampling over the call stack of the listener thread after this (?? in gdb). I'm trying to prepare a standalone reproducible test case and will attach it here as soon as I have it ready. ------- Additional comments from tholzer Thu Dec 18 19:44:34 -0700 2008 ------- After I see the following in the log, the qmaster stops responding to GDI requests and memory usage climbs steadily until the limit (48GB) has been reached (usually within 5-10 minutes). 12/19/2008 15:35:55|worker|yori|E|unable to find job 56 from the scheduler order package 12/19/2008 15:35:55|worker|yori|W|Skipping remaining 219 orders 12/19/2008 15:35:55|schedu|yori|E|unable to find job 56 from the scheduler order package 12/19/2008 15:35:56|worker|yori|E|scheduler tries to schedule job 1416.1 twice 12/19/2008 15:35:56|worker|yori|W|Skipping remaining 219 orders 12/19/2008 15:35:56|schedu|yori|E|scheduler tries to schedule job 1416.1 twice 12/19/2008 15:37:03|worker|yori|E|unable to find job 22 from the scheduler order package 12/19/2008 15:37:03|worker|yori|W|Skipping remaining 230 orders 12/19/2008 15:37:03|schedu|yori|E|unable to find job 22 from the scheduler order package 12/19/2008 15:37:05|schedu|yori|E|could not find job "56" in master list 12/19/2008 15:37:05|schedu|yori|E|callback function for event "29632. EVENT DEL JOB 56.1" failed 12/19/2008 15:38:12|schedu|yori|E|could not find job "22" in master list 12/19/2008 15:38:12|schedu|yori|E|callback function for event "34758. EVENT DEL JOB 22.1" failed I'm still putting together the dependency tree for the jobs. ------- Additional comments from tholzer Sun Dec 21 20:53:39 -0700 2008 ------- Created an attachment (id=186) hold_jids ------- Additional comments from tholzer Sun Dec 21 20:59:08 -0700 2008 ------- Hi, I've attached two text files, bad.txt & good.txt. They specify the -hold_jids for a list of jobs (starting with jobid of 1 in the first line with no dependencies). Submit jobs via DRMAA: command = /bin/true native specification = -o /dev/null -j y -b y -shell n -w n -wd /tmp + hold_jid line from file The file bad.txt contains an additional 21 jobs with a large dependency list. This will send the qmaster into a spin. If these 21 jobs are removed (good.txt), everything is fine. ------- Additional comments from crei Wed Jan 7 08:26:21 -0700 2009 ------- Changed subcomponent. It turns out that this issue is more related to job dependencies (or drmaa).
Note: See
TracTickets for help on using
tickets.