[GE issues] [Issue 2842] New - listener threads get stuck in cl_commlib_receive_message

tholzer tholzer at wetafx.co.nz
Wed Dec 17 21:21:29 GMT 2008


http://gridengine.sunsource.net/issues/show_bug.cgi?id=2842
                 Issue #|2842
                 Summary|listener threads get stuck in cl_commlib_receive_messa
                        |ge 
               Component|gridengine
                 Version|6.2u2
                Platform|PC
                     URL|
              OS/Version|Linux
                  Status|NEW
       Status whiteboard|
                Keywords|
              Resolution|
              Issue type|DEFECT
                Priority|P2
            Subcomponent|communication
             Assigned to|crei
             Reported by|tholzer






------- Additional comments from tholzer at sunsource.net Wed Dec 17 13:21:27 -0800 2008 -------
We have started testing (V62u2beta) on a large cluster (2,500 execution hosts).

We are experiencing a problem with the scheduler taking too long in the job
dispatching phase when a number of execution hosts have become unreachable.

This is due to the synchronous nature of job dispatches. Once we have lost
connection to an execution host, the job dispatching to it will wait 60 seconds
(CL_DEFINE_SYNCHRON_RECEIVE_TIMEOUT) for it to fail.

Once this happens to all listener threads (2 by default), the qmaster stops
responding.

In a large cluster like ours, there are constantly hosts coming & going.

I've also found this in the comments:

   /* TODO: do trigger or not? depends on syncrhron
    * TODO: Remove synchron flag from this function, it is only used for
get_event_list call in event client.
            event client code should be re-written, not to use this synchron
flag set to false
    */ 

The following stack trace illustrates the condition:

Thread 4 (Thread 1160374592 (LWP 24654)):
#0  0x0000003a8760a697 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00000000005fc640 in cl_thread_wait_for_thread_condition
(condition=0x8d0560, sec=1, micro_sec=0) at ../libs/comm/lists/cl_thread.c:259
#2  0x00000000005ecf70 in cl_commlib_receive_message (handle=0x8b86f0,
un_resolved_hostname=0x4529deb0 "", component_name=0x4529def0 "",
component_id=0, synchron=CL_TRUE,
    response_mid=0, message=0x4529ddd0, sender=0x4529ddc8) at
../libs/comm/cl_commlib.c:4991
#3  0x0000000000532f91 in sge_gdi2_get_any_request (ctx=0xa38100,
rhost=0x4529deb0 "", commproc=0x4529def0 "", id=0x4529df30, pb=0x4529df40,
tag=0x4529df34, synchron=1,
    for_request_mid=0, mid=0x4529df38) at ../libs/gdi/sge_gdi2.c:681
#4  0x00000000004a78cb in sge_qmaster_process_message (ctx=0xa38100,
monitor=0x4529e040) at ../daemons/qmaster/sge_qmaster_process_message.c:142
#5  0x000000000042cee7 in sge_listener_main (arg=0x7f240c2c3120) at
../daemons/qmaster/sge_thread_listener.c:196
#6  0x0000003a876062f7 in start_thread () from /lib64/libpthread.so.0
#7  0x00000030cbece85d in clone () from /lib64/libc.so.6

Thread 3 (Thread 1168767296 (LWP 24655)):
#0  0x0000003a8760a697 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00000000005fc640 in cl_thread_wait_for_thread_condition
(condition=0x8d0560, sec=1, micro_sec=0) at ../libs/comm/lists/cl_thread.c:259
#2  0x00000000005ecf70 in cl_commlib_receive_message (handle=0x8b86f0,
un_resolved_hostname=0x45a9eeb0 "", component_name=0x45a9eef0 "",
component_id=0, synchron=CL_TRUE,
    response_mid=0, message=0x45a9edd0, sender=0x45a9edc8) at
../libs/comm/cl_commlib.c:4991
#3  0x0000000000532f91 in sge_gdi2_get_any_request (ctx=0x7f240c2c53c0,
rhost=0x45a9eeb0 "", commproc=0x45a9eef0 "", id=0x45a9ef30, pb=0x45a9ef40,
tag=0x45a9ef34, synchron=1,
    for_request_mid=0, mid=0x45a9ef38) at ../libs/gdi/sge_gdi2.c:681
#4  0x00000000004a78cb in sge_qmaster_process_message (ctx=0x7f240c2c53c0,
monitor=0x45a9f040) at ../daemons/qmaster/sge_qmaster_process_message.c:142
#5  0x000000000042cee7 in sge_listener_main (arg=0x7f240c2c3450) at
../daemons/qmaster/sge_thread_listener.c:196
#6  0x0000003a876062f7 in start_thread () from /lib64/libpthread.so.0
#7  0x00000030cbece85d in clone () from /lib64/libc.so.6

Thread 2 (Thread 1177160000 (LWP 24656)):
#0  0x0000003a8760a697 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x000000000053b5d0 in sge_gdi_packet_wait_till_handled
(packet=0x7f22e18407d0) at ../libs/gdi/sge_gdi_packet_internal.c:200
#2  0x000000000053cc8e in sge_gdi_packet_wait_for_result_internal (ctx=0xa392f0,
answer_list=0x0, packet=0x4629faa0, malpp=0x4629fae0) at
../libs/gdi/sge_gdi_packet_internal.c:890
#3  0x0000000000532779 in sge_gdi2_wait (ctx=0xa392f0, alpp=0x0,
malpp=0x4629fae0, state=0x7f22e1840600) at ../libs/gdi/sge_gdi2.c:497
#4  0x000000000043fac2 in sge_schedd_block_until_orders_processed (ctx=0xa392f0,
answer_list=0x0) at ../daemons/qmaster/sge_sched_order.c:160
#5  0x00000000004337d5 in sge_scheduler_main (arg=0x7f240c2c2d40) at
../daemons/qmaster/sge_thread_scheduler.c:913
#6  0x0000003a876062f7 in start_thread () from /lib64/libpthread.so.0
#7  0x00000030cbece85d in clone () from /lib64/libc.so.6

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=93040

To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list