[GE users] qrsh fails

Jean-Paul Minet minet at cism.ucl.ac.be
Wed Jan 18 16:41:35 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Shannon,

Actually, there is indeed a bug in the ics_sdp driver.  Disabling SDP sorted out 
the problem.

FYI (I didn't understand it), when loaded, SDP "deviates" all application TCP 
calls from the normal TCP/IP stack, to handle them directly through RDMA.  So 
all applications are concerned (regardless if they use or not IPoIB)... this 
explains why sge_shepherd would end up using SDP module code.

Cheers and thanks again for your help

Jean-Paul


Shannon V. Davidson wrote:
> Jean-Paul Minet wrote:
> 
>> Ooops, sorry, I messed up my reply....  we are not talking about user 
>> programs using mpich but a simple qrsh command.
>>
>> I am puzzled by the fact the qrsh interacts somewhere with infiniband 
>> code.  How can that be? The command issued is on the submit host:
>>
>> qrsh -verbose -l mem_free=10M -l num_proc=2 -q all.q at lmexec-92 date
>>
>> Now lmexec-92 is a dns hostname, with a specific IP address reachable 
>> by ethernet, not IPoIB.  What would make sge_shepherd communicate 
>> across IPoIB?  I missed a config/parameter somwhere?
> 
> 
> 
> I can think of 2 reasons why sge_shepherd might be using IPoIB or SDP:
> 
> 1. One of the hostnames being used by the sge_shepherd is resolving to 
> an IPoIB address.
> 2. The SDP library is set in the LD_PRELOAD environment variable or in 
> /etc/ld.so.preload.
> 
> Shannon
> 
>>
>> Jean-paul
>>
>> Shannon V. Davidson wrote:
>>
>>> Jean-Paul,
>>>
>>> It appears that you're blowing up in the ics_sdp module, which is 
>>> the  Infiniband Sockets Direct Protocol driver.  If the sge_shepherd 
>>> is communicating to the master across IPoIB, you might try turning 
>>> off Sockets Direct or try running over ethernet.  You might also want 
>>> to report this problem to whomever supplied you with your Infiniband 
>>> software.
>>>
>>> Shannon
>>>
>>>
>>> Jean-Paul Minet wrote:
>>>
>>>> Reuti,
>>>>
>>>>>> Just tried with a few hosts, and the behavior is the same...
>>>>>>
>>>>>
>>>>> Okay, so it's not a hardware problem. Can you please check the the 
>>>>> / var/log/messages on the nodes (not the messages file from SGE). 
>>>>> What  
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Here is the section which, I beleived, is linked to the error observed:
>>>>
>>>> Jan 16 08:59:04 lmexec-92 kernel: Unable to handle kernel NULL 
>>>> pointer dereference at 0000000000000108 RIP:
>>>> Jan 16 08:59:04 lmexec-92 kernel: 
>>>> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1}
>>>> Jan 16 08:59:04 lmexec-92 kernel: PML4 c95df067 PGD 0
>>>> Jan 16 08:59:04 lmexec-92 kernel: Oops: 0000 [7] SMP
>>>> Jan 16 08:59:04 lmexec-92 kernel: CPU 0
>>>> Jan 16 08:59:04 lmexec-92 kernel: Pid: 25640, comm: sge_shepherd 
>>>> Tainted: GF U 2.6.5-7.97-smp
>>>> Jan 16 08:59:04 lmexec-92 kernel: RIP: 0010:[<ffffffffa0269581>] 
>>>> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1}
>>>> Jan 16 08:59:04 lmexec-92 kernel: RSP: 0018:00000100e55cfeb8  
>>>> EFLAGS: 00010246
>>>> Jan 16 08:59:04 lmexec-92 kernel: RAX: 0000000000000007 RBX: 
>>>> 0000010038ca6d80 RCX: 0000000000000000
>>>> Jan 16 08:59:04 lmexec-92 kernel: RDX: 00000000ffffffea RSI: 
>>>> 0000000000000800 RDI: 0000000000000000
>>>> Jan 16 08:59:04 lmexec-92 kernel: RBP: 000000000000000a R08: 
>>>> 00000000ffffffff R09: 00000000ffffffff
>>>> Jan 16 08:59:04 lmexec-92 kernel: R10: 0000000000000000 R11: 
>>>> 0000000000000206 R12: 000001006f9c4800
>>>> Jan 16 08:59:04 lmexec-92 kernel: R13: 0000010038ca6d80 R14: 
>>>> 0000000000000000 R15: 0000010038ca7088
>>>> Jan 16 08:59:04 lmexec-92 kernel: FS:  0000002a95bfd8a0(0000) 
>>>> GS:ffffffff804e7e00(0000) knlGS:000000005556c9a0
>>>> Jan 16 08:59:04 lmexec-92 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
>>>> 000000008005003b
>>>> Jan 16 08:59:04 lmexec-92 kernel: CR2: 0000000000000108 CR3: 
>>>> 0000000000101000 CR4: 00000000000006e0
>>>> Jan 16 08:59:04 lmexec-92 kernel: Process sge_shepherd (pid: 25640, 
>>>> threadinfo 00000100e55ce000, task 000001007939e490)
>>>> Jan 16 08:59:04 lmexec-92 kernel: Stack: 0000010038ca6d80 
>>>> ffffffffa02638bb 000000000000003b 0000010038ca6d80
>>>> Jan 16 08:59:04 lmexec-92 kernel:        000000000000000a 
>>>> 000001006f9c4800 0000010038ca7088 0000000000000800
>>>> Jan 16 08:59:04 lmexec-92 kernel:        00000000401c67b0 
>>>> ffffffffa0263af5
>>>> Jan 16 08:59:04 lmexec-92 kernel: Call 
>>>> Trace:<ffffffffa02638bb>{:ics_sdp:sdp_stop_listen+59} 
>>>> <ffffffffa0263af5>{:ics_sdp:sdp_disconnect+149}
>>>> Jan 16 08:59:04 lmexec-92 kernel:        
>>>> <ffffffff8030dede>{inet_shutdown+206} 
>>>> <ffffffff802c0fcc>{sys_shutdown+76}
>>>> Jan 16 08:59:04 lmexec-92 kernel:        
>>>> <ffffffff801106f4>{system_call+124}
>>>> Jan 16 08:59:04 lmexec-92 kernel:
>>>> Jan 16 08:59:04 lmexec-92 kernel: Code: 48 83 bf 08 01 00 00 00 48 
>>>> 89 fb 75 1a 31 c9 ba c3 02 00 00
>>>> Jan 16 08:59:04 lmexec-92 kernel: RIP 
>>>> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1} RSP <00000100e55cfeb8>
>>>> Jan 16 08:59:04 lmexec-92 kernel: CR2: 0000000000000108
>>>>
>>>> it seems it involves sge_shepherd...
>>>>
>>>>> type of network card is installed, and which modul is loaded for it?
>>>>>
>>>>> lsmod
>>>>> lspci
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Newisys NetXtreme BCM5704 Gigabit Ethernet (the standard one on Sun 
>>>> V20z) with "tg3" module loaded
>>>>
>>>> thanks for your help
>>>>
>>>> Jean-paul
>>>>
>>>>> might give you some hints. - Reuti
>>>>>
>>>>>>> - Is this new and worked before? As 9.0 isn't the latest of 9.x,  
>>>>>>> I'd  assume that your cluster is already in operation for some  
>>>>>>> time now.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> It never worked before.  Install is new; SGE configured and more 
>>>>>> or  less working, except bits and pieces here and there, among 
>>>>>> which  tight integration for mpich/ethernet interconnect; I have 
>>>>>> also  trouble with the infiniband interconnect integration: the 
>>>>>> patch for  mpich/infiniband and SGE tight integration, available 
>>>>>> on the HowTo  site, doesn't match the version of mpich supplied 
>>>>>> and customized by  the Infiniband vendor.  I am awaiting support 
>>>>>> form Infiniband  vendor to get latest mpich/mvapich version 
>>>>>> installed/customized.
>>>>>>
>>>>>> thnks & rgds
>>>>>>
>>>>>> Jean-Paul
>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>> Would someone have an idea on how to further debug the problem  
>>>>>>>> (I  have tried using tcpdump between the submit host and the  
>>>>>>>> target  host, as well as the qmaster host and the target host, 
>>>>>>>> to  dig into  communication bits, but it's getting complicated...)?
>>>>>>>>
>>>>>>>> Thks for any help
>>>>>>>>
>>>>>>>> Jean-paul
>>>>>>>>
>>>>>>>> ---- qrsh command and output ----
>>>>>>>> lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l   
>>>>>>>> mem_free=10M -l num_proc=2 -q all.q at lmexec-92 date
>>>>>>>> local configuration lemaitre.cism.ucl.ac.be not defined - 
>>>>>>>> using   global configuration
>>>>>>>> your job 1788 ("date") has been submitted
>>>>>>>> waiting for interactive job to be scheduled ...
>>>>>>>> Your interactive job 1788 has been successfully scheduled.
>>>>>>>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to  
>>>>>>>> host  lmexec-92 ...
>>>>>>>> poll: protocol failure in circuit setup
>>>>>>>> /gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
>>>>>>>> reading exit code from shepherd ... 129
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Jean-Paul Minet
>>>>>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage  
>>>>>>>> de  Masse
>>>>>>>> Université Catholique de Louvain
>>>>>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>>>>>
>>>>>>>> -------------------------------------------------------------------- 
>>>>>>>> -
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail: 
>>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------- 
>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Jean-Paul Minet
>>>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  
>>>>>> Masse
>>>>>> Université Catholique de Louvain
>>>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> 
> 

-- 
Jean-Paul Minet
Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de Masse
Université Catholique de Louvain
Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list