[GE issues] [Issue 2795] New - qmaster segmentation fault in cl_commlib_receive_message

tholzer tholzer at wetafx.co.nz
Thu Nov 20 02:58:51 GMT 2008


http://gridengine.sunsource.net/issues/show_bug.cgi?id=2795
                 Issue #|2795
                 Summary|qmaster segmentation fault in cl_commlib_receive_messa
                        |ge
               Component|gridengine
                 Version|6.1u4
                Platform|PC
                     URL|
              OS/Version|Linux
                  Status|NEW
       Status whiteboard|
                Keywords|
              Resolution|
              Issue type|DEFECT
                Priority|P2
            Subcomponent|communication
             Assigned to|crei
             Reported by|tholzer






------- Additional comments from tholzer at sunsource.net Wed Nov 19 18:58:48 -0800 2008 -------
Hi,

we occasionally get a core dump on a heavily loaded system with a large number
of connected clients (>1000): 

# gdb $SGE_ROOT/bin/lx26-amd64/sge_qmaster core
...
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000554de2 in cl_commlib_receive_message (handle=0x702a10,
un_resolved_hostname=0x44d96fa0 "", component_name=0x44d96fe0 "",
component_id=0, synchron=CL_TRUE, response_mid=0, message=0x44d96e98,
    sender=0x44d96e90) at ../libs/comm/cl_commlib.c:4845
4845                            *sender =
cl_com_create_endpoint(connection->receiver->comp_host,
(gdb) bt
#0  0x0000000000554de2 in cl_commlib_receive_message (handle=0x702a10,
un_resolved_hostname=0x44d96fa0 "", component_name=0x44d96fe0 "",
component_id=0, synchron=CL_TRUE, response_mid=0, message=0x44d96e98,
    sender=0x44d96e90) at ../libs/comm/cl_commlib.c:4845
#1  0x00000000004c4845 in sge_gdi2_get_any_request (ctx=0x2da95c0,
rhost=0x44d96fa0 "", commproc=0x44d96fe0 "", id=0x44d97020, pb=0x44d97030,
tag=0x44d97024, synchron=1, for_request_mid=0, mid=0x44d97028)
    at ../libs/gdi/sge_gdi2.c:984
#2  0x000000000047afc3 in sge_qmaster_process_message (ctx=0x2da95c0,
anArg=0x58b891, monitor=0x44d970a0) at
../daemons/qmaster/sge_qmaster_process_message.c:419
#3  0x0000000000428fa1 in message_thread (anArg=0x58b891) at
../daemons/qmaster/sge_qmaster_threads.c:926
#4  0x0000003f3d2062f7 in ?? ()
#5  0x0000000000000000 in ?? ()
(gdb) print connection->receiver->comp_host
Cannot access memory at address 0x0
(gdb) print connection->receiver
$1 = (cl_com_endpoint_t *) 0x0
(gdb) print connection
$2 = (cl_com_connection_t *) 0x7f3ed54967a0

It seems that connection->receiver gets set to NULL somewhere. This only happens
every few weeks and we don't have a reliable way to reproduce it.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=89159

To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list