Opened 8 years ago

Last modified 7 years ago

#615 new defect

IZ2842: listener threads get stuck in cl_commlib_receive_message

Reported by: tholzer Owned by:
Priority: high Milestone:
Component: sge Version: 6.2u2
Severity: Keywords: PC Linux qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2842]

        Issue #:      2842             Platform:     PC       Reporter: tholzer (tholzer)
       Component:     gridengine          OS:        Linux
     Subcomponent:    qmaster          Version:      6.2u2       CC:    None defined
        Status:       NEW              Priority:     P2
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    crei (crei)
      QA Contact:     ernst
          URL:
       * Summary:     listener threads get stuck in cl_commlib_receive_message
   Status whiteboard:
      Attachments:
                      Date/filename:                              Description:                   Submitted by:
                      Sun Dec 21 20:53:00 -0700 2008: deps.tar.gz hold_jids (application/x-gzip) tholzer

     Issue 2842 blocks:
   Votes for issue 2842:


   Opened: Wed Dec 17 14:21:00 -0700 2008 
------------------------


We have started testing (V62u2beta) on a large cluster (2,500 execution hosts).

We are experiencing a problem with the scheduler taking too long in the job
dispatching phase when a number of execution hosts have become unreachable.

This is due to the synchronous nature of job dispatches. Once we have lost
connection to an execution host, the job dispatching to it will wait 60 seconds
(CL_DEFINE_SYNCHRON_RECEIVE_TIMEOUT) for it to fail.

Once this happens to all listener threads (2 by default), the qmaster stops
responding.

In a large cluster like ours, there are constantly hosts coming & going.

I've also found this in the comments:

   /* TODO: do trigger or not? depends on syncrhron
    * TODO: Remove synchron flag from this function, it is only used for
get_event_list call in event client.
            event client code should be re-written, not to use this synchron
flag set to false
    */

The following stack trace illustrates the condition:

Thread 4 (Thread 1160374592 (LWP 24654)):
#0  0x0000003a8760a697 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00000000005fc640 in cl_thread_wait_for_thread_condition
(condition=0x8d0560, sec=1, micro_sec=0) at ../libs/comm/lists/cl_thread.c:259
#2  0x00000000005ecf70 in cl_commlib_receive_message (handle=0x8b86f0,
un_resolved_hostname=0x4529deb0 "", component_name=0x4529def0 "",
component_id=0, synchron=CL_TRUE,
    response_mid=0, message=0x4529ddd0, sender=0x4529ddc8) at
../libs/comm/cl_commlib.c:4991
#3  0x0000000000532f91 in sge_gdi2_get_any_request (ctx=0xa38100,
rhost=0x4529deb0 "", commproc=0x4529def0 "", id=0x4529df30, pb=0x4529df40,
tag=0x4529df34, synchron=1,
    for_request_mid=0, mid=0x4529df38) at ../libs/gdi/sge_gdi2.c:681
#4  0x00000000004a78cb in sge_qmaster_process_message (ctx=0xa38100,
monitor=0x4529e040) at ../daemons/qmaster/sge_qmaster_process_message.c:142
#5  0x000000000042cee7 in sge_listener_main (arg=0x7f240c2c3120) at
../daemons/qmaster/sge_thread_listener.c:196
#6  0x0000003a876062f7 in start_thread () from /lib64/libpthread.so.0
#7  0x00000030cbece85d in clone () from /lib64/libc.so.6

Thread 3 (Thread 1168767296 (LWP 24655)):
#0  0x0000003a8760a697 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00000000005fc640 in cl_thread_wait_for_thread_condition
(condition=0x8d0560, sec=1, micro_sec=0) at ../libs/comm/lists/cl_thread.c:259
#2  0x00000000005ecf70 in cl_commlib_receive_message (handle=0x8b86f0,
un_resolved_hostname=0x45a9eeb0 "", component_name=0x45a9eef0 "",
component_id=0, synchron=CL_TRUE,
    response_mid=0, message=0x45a9edd0, sender=0x45a9edc8) at
../libs/comm/cl_commlib.c:4991
#3  0x0000000000532f91 in sge_gdi2_get_any_request (ctx=0x7f240c2c53c0,
rhost=0x45a9eeb0 "", commproc=0x45a9eef0 "", id=0x45a9ef30, pb=0x45a9ef40,
tag=0x45a9ef34, synchron=1,
    for_request_mid=0, mid=0x45a9ef38) at ../libs/gdi/sge_gdi2.c:681
#4  0x00000000004a78cb in sge_qmaster_process_message (ctx=0x7f240c2c53c0,
monitor=0x45a9f040) at ../daemons/qmaster/sge_qmaster_process_message.c:142
#5  0x000000000042cee7 in sge_listener_main (arg=0x7f240c2c3450) at
../daemons/qmaster/sge_thread_listener.c:196
#6  0x0000003a876062f7 in start_thread () from /lib64/libpthread.so.0
#7  0x00000030cbece85d in clone () from /lib64/libc.so.6

Thread 2 (Thread 1177160000 (LWP 24656)):
#0  0x0000003a8760a697 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x000000000053b5d0 in sge_gdi_packet_wait_till_handled
(packet=0x7f22e18407d0) at ../libs/gdi/sge_gdi_packet_internal.c:200
#2  0x000000000053cc8e in sge_gdi_packet_wait_for_result_internal (ctx=0xa392f0,
answer_list=0x0, packet=0x4629faa0, malpp=0x4629fae0) at
../libs/gdi/sge_gdi_packet_internal.c:890
#3  0x0000000000532779 in sge_gdi2_wait (ctx=0xa392f0, alpp=0x0,
malpp=0x4629fae0, state=0x7f22e1840600) at ../libs/gdi/sge_gdi2.c:497
#4  0x000000000043fac2 in sge_schedd_block_until_orders_processed (ctx=0xa392f0,
answer_list=0x0) at ../daemons/qmaster/sge_sched_order.c:160
#5  0x00000000004337d5 in sge_scheduler_main (arg=0x7f240c2c2d40) at
../daemons/qmaster/sge_thread_scheduler.c:913
#6  0x0000003a876062f7 in start_thread () from /lib64/libpthread.so.0
#7  0x00000030cbece85d in clone () from /lib64/libc.so.6

   ------- Additional comments from tholzer Wed Dec 17 19:17:54 -0700 2008 -------
This seems to happen when submitting jobs with a large number of -hold_jid job
ids (>12,000) being specified via DRMAAs native specification attribute.


   ------- Additional comments from crei Thu Dec 18 01:47:41 -0700 2008 -------
tholzer,

can you please test if a
qping -info YOUR_MASTER_HOST $SGE_QMASTER_PORT qmaster 1

gets an answer or not. If qping is working the communication layer is working
and the problem is in the gdi threads.

Additional information is also available when starting qmaster in debug level:
>setenv SGE_ND 1  -> do not daemonize
>setenv SGE_COMMLIB_DEBUG 2 (or level 3) -> get communication logging
>$SGE_ROOT/bin/$ARCH/sge_qmaster






   ------- Additional comments from andreas Thu Dec 18 04:13:20 -0700 2008 -------
I failed to reproduce this case until now.

What I understand from your description, is that qmaster gets stuck when the
synchronuous job delivery in qmaster fails two times after the execd became
unreachable.

Have you tried to make job delivery asynchronuous?

Doing so were realatively easy: Just pass '1' instead of '0' to as second
parameter to gdi2_send_message_pb() in function send_job().




   ------- Additional comments from andreas Thu Dec 18 05:43:17 -0700 2008 -------
Somehow you confused me:

Actually job delivery from qmaster to execd is already asynchronuous.

So I'm curious what you mean with "synchronous nature of job dispatches"?

   ------- Additional comments from tholzer Thu Dec 18 14:45:14 -0700 2008 -------
I had qping running as well and that stopped responding eventually. I've also
put a line into sge_gdi2_get_any_request (sge_gdi2.c) to set synchron = 0 in all
cases.

This made matters worse as something seemed to be trampling over the call stack
of the listener thread after this (?? in gdb).

I'm trying to prepare a standalone reproducible test case and will attach it
here as soon as I have it ready.

   ------- Additional comments from tholzer Thu Dec 18 19:44:34 -0700 2008 -------
After I see the following in the log, the qmaster stops responding to GDI
requests and memory usage climbs steadily until the limit (48GB) has been
reached (usually within 5-10 minutes).

12/19/2008 15:35:55|worker|yori|E|unable to find job 56 from the scheduler order
package
12/19/2008 15:35:55|worker|yori|W|Skipping remaining 219 orders
12/19/2008 15:35:55|schedu|yori|E|unable to find job 56 from the scheduler order
package
12/19/2008 15:35:56|worker|yori|E|scheduler tries to schedule job 1416.1 twice
12/19/2008 15:35:56|worker|yori|W|Skipping remaining 219 orders
12/19/2008 15:35:56|schedu|yori|E|scheduler tries to schedule job 1416.1 twice
12/19/2008 15:37:03|worker|yori|E|unable to find job 22 from the scheduler order
package
12/19/2008 15:37:03|worker|yori|W|Skipping remaining 230 orders
12/19/2008 15:37:03|schedu|yori|E|unable to find job 22 from the scheduler order
package
12/19/2008 15:37:05|schedu|yori|E|could not find job "56" in master list
12/19/2008 15:37:05|schedu|yori|E|callback function for event "29632. EVENT DEL
JOB 56.1" failed
12/19/2008 15:38:12|schedu|yori|E|could not find job "22" in master list
12/19/2008 15:38:12|schedu|yori|E|callback function for event "34758. EVENT DEL
JOB 22.1" failed

I'm still putting together the dependency tree for the jobs.

   ------- Additional comments from tholzer Sun Dec 21 20:53:39 -0700 2008 -------
Created an attachment (id=186)
hold_jids

   ------- Additional comments from tholzer Sun Dec 21 20:59:08 -0700 2008 -------
Hi,

I've attached two text files, bad.txt & good.txt. They specify the -hold_jids
for a list of jobs (starting with jobid of 1 in the first line with no
dependencies).

Submit jobs via DRMAA:

command = /bin/true
native specification =
-o /dev/null -j y -b y -shell n -w n -wd /tmp + hold_jid line from file

The file bad.txt contains an additional 21 jobs with a large dependency list.
This will send the qmaster into a spin. If these 21 jobs are removed (good.txt),
everything is fine.

   ------- Additional comments from crei Wed Jan 7 08:26:21 -0700 2009 -------
Changed subcomponent. It turns out that this issue is more related to job
dependencies (or drmaa).

Change History (0)

Note: See TracTickets for help on using tickets.