[GE users] qrsh fails

Shannon V. Davidson svdavidson at charter.net
Mon Jan 16 15:46:50 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Jean-Paul Minet wrote:

> Ooops, sorry, I messed up my reply....  we are not talking about user 
> programs using mpich but a simple qrsh command.
>
> I am puzzled by the fact the qrsh interacts somewhere with infiniband 
> code.  How can that be? The command issued is on the submit host:
>
> qrsh -verbose -l mem_free=10M -l num_proc=2 -q all.q at lmexec-92 date
>
> Now lmexec-92 is a dns hostname, with a specific IP address reachable 
> by ethernet, not IPoIB.  What would make sge_shepherd communicate 
> across IPoIB?  I missed a config/parameter somwhere?


I can think of 2 reasons why sge_shepherd might be using IPoIB or SDP:

1. One of the hostnames being used by the sge_shepherd is resolving to 
an IPoIB address.
2. The SDP library is set in the LD_PRELOAD environment variable or in 
/etc/ld.so.preload.

Shannon

>
> Jean-paul
>
> Shannon V. Davidson wrote:
>
>> Jean-Paul,
>>
>> It appears that you're blowing up in the ics_sdp module, which is 
>> the  Infiniband Sockets Direct Protocol driver.  If the sge_shepherd 
>> is communicating to the master across IPoIB, you might try turning 
>> off Sockets Direct or try running over ethernet.  You might also want 
>> to report this problem to whomever supplied you with your Infiniband 
>> software.
>>
>> Shannon
>>
>>
>> Jean-Paul Minet wrote:
>>
>>> Reuti,
>>>
>>>>> Just tried with a few hosts, and the behavior is the same...
>>>>>
>>>>
>>>> Okay, so it's not a hardware problem. Can you please check the the 
>>>> / var/log/messages on the nodes (not the messages file from SGE). 
>>>> What  
>>>
>>>
>>>
>>>
>>> Here is the section which, I beleived, is linked to the error observed:
>>>
>>> Jan 16 08:59:04 lmexec-92 kernel: Unable to handle kernel NULL 
>>> pointer dereference at 0000000000000108 RIP:
>>> Jan 16 08:59:04 lmexec-92 kernel: 
>>> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1}
>>> Jan 16 08:59:04 lmexec-92 kernel: PML4 c95df067 PGD 0
>>> Jan 16 08:59:04 lmexec-92 kernel: Oops: 0000 [7] SMP
>>> Jan 16 08:59:04 lmexec-92 kernel: CPU 0
>>> Jan 16 08:59:04 lmexec-92 kernel: Pid: 25640, comm: sge_shepherd 
>>> Tainted: GF U 2.6.5-7.97-smp
>>> Jan 16 08:59:04 lmexec-92 kernel: RIP: 0010:[<ffffffffa0269581>] 
>>> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1}
>>> Jan 16 08:59:04 lmexec-92 kernel: RSP: 0018:00000100e55cfeb8  
>>> EFLAGS: 00010246
>>> Jan 16 08:59:04 lmexec-92 kernel: RAX: 0000000000000007 RBX: 
>>> 0000010038ca6d80 RCX: 0000000000000000
>>> Jan 16 08:59:04 lmexec-92 kernel: RDX: 00000000ffffffea RSI: 
>>> 0000000000000800 RDI: 0000000000000000
>>> Jan 16 08:59:04 lmexec-92 kernel: RBP: 000000000000000a R08: 
>>> 00000000ffffffff R09: 00000000ffffffff
>>> Jan 16 08:59:04 lmexec-92 kernel: R10: 0000000000000000 R11: 
>>> 0000000000000206 R12: 000001006f9c4800
>>> Jan 16 08:59:04 lmexec-92 kernel: R13: 0000010038ca6d80 R14: 
>>> 0000000000000000 R15: 0000010038ca7088
>>> Jan 16 08:59:04 lmexec-92 kernel: FS:  0000002a95bfd8a0(0000) 
>>> GS:ffffffff804e7e00(0000) knlGS:000000005556c9a0
>>> Jan 16 08:59:04 lmexec-92 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
>>> 000000008005003b
>>> Jan 16 08:59:04 lmexec-92 kernel: CR2: 0000000000000108 CR3: 
>>> 0000000000101000 CR4: 00000000000006e0
>>> Jan 16 08:59:04 lmexec-92 kernel: Process sge_shepherd (pid: 25640, 
>>> threadinfo 00000100e55ce000, task 000001007939e490)
>>> Jan 16 08:59:04 lmexec-92 kernel: Stack: 0000010038ca6d80 
>>> ffffffffa02638bb 000000000000003b 0000010038ca6d80
>>> Jan 16 08:59:04 lmexec-92 kernel:        000000000000000a 
>>> 000001006f9c4800 0000010038ca7088 0000000000000800
>>> Jan 16 08:59:04 lmexec-92 kernel:        00000000401c67b0 
>>> ffffffffa0263af5
>>> Jan 16 08:59:04 lmexec-92 kernel: Call 
>>> Trace:<ffffffffa02638bb>{:ics_sdp:sdp_stop_listen+59} 
>>> <ffffffffa0263af5>{:ics_sdp:sdp_disconnect+149}
>>> Jan 16 08:59:04 lmexec-92 kernel:        
>>> <ffffffff8030dede>{inet_shutdown+206} 
>>> <ffffffff802c0fcc>{sys_shutdown+76}
>>> Jan 16 08:59:04 lmexec-92 kernel:        
>>> <ffffffff801106f4>{system_call+124}
>>> Jan 16 08:59:04 lmexec-92 kernel:
>>> Jan 16 08:59:04 lmexec-92 kernel: Code: 48 83 bf 08 01 00 00 00 48 
>>> 89 fb 75 1a 31 c9 ba c3 02 00 00
>>> Jan 16 08:59:04 lmexec-92 kernel: RIP 
>>> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1} RSP <00000100e55cfeb8>
>>> Jan 16 08:59:04 lmexec-92 kernel: CR2: 0000000000000108
>>>
>>> it seems it involves sge_shepherd...
>>>
>>>> type of network card is installed, and which modul is loaded for it?
>>>>
>>>> lsmod
>>>> lspci
>>>
>>>
>>>
>>>
>>> Newisys NetXtreme BCM5704 Gigabit Ethernet (the standard one on Sun 
>>> V20z) with "tg3" module loaded
>>>
>>> thanks for your help
>>>
>>> Jean-paul
>>>
>>>> might give you some hints. - Reuti
>>>>
>>>>>> - Is this new and worked before? As 9.0 isn't the latest of 9.x,  
>>>>>> I'd  assume that your cluster is already in operation for some  
>>>>>> time now.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> It never worked before.  Install is new; SGE configured and more 
>>>>> or  less working, except bits and pieces here and there, among 
>>>>> which  tight integration for mpich/ethernet interconnect; I have 
>>>>> also  trouble with the infiniband interconnect integration: the 
>>>>> patch for  mpich/infiniband and SGE tight integration, available 
>>>>> on the HowTo  site, doesn't match the version of mpich supplied 
>>>>> and customized by  the Infiniband vendor.  I am awaiting support 
>>>>> form Infiniband  vendor to get latest mpich/mvapich version 
>>>>> installed/customized.
>>>>>
>>>>> thnks & rgds
>>>>>
>>>>> Jean-Paul
>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>> Would someone have an idea on how to further debug the problem  
>>>>>>> (I  have tried using tcpdump between the submit host and the  
>>>>>>> target  host, as well as the qmaster host and the target host, 
>>>>>>> to  dig into  communication bits, but it's getting complicated...)?
>>>>>>>
>>>>>>> Thks for any help
>>>>>>>
>>>>>>> Jean-paul
>>>>>>>
>>>>>>> ---- qrsh command and output ----
>>>>>>> lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l   
>>>>>>> mem_free=10M -l num_proc=2 -q all.q at lmexec-92 date
>>>>>>> local configuration lemaitre.cism.ucl.ac.be not defined - 
>>>>>>> using   global configuration
>>>>>>> your job 1788 ("date") has been submitted
>>>>>>> waiting for interactive job to be scheduled ...
>>>>>>> Your interactive job 1788 has been successfully scheduled.
>>>>>>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to  
>>>>>>> host  lmexec-92 ...
>>>>>>> poll: protocol failure in circuit setup
>>>>>>> /gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
>>>>>>> reading exit code from shepherd ... 129
>>>>>>>
>>>>>>> -- 
>>>>>>> Jean-Paul Minet
>>>>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage  
>>>>>>> de  Masse
>>>>>>> Université Catholique de Louvain
>>>>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>>>>
>>>>>>> -------------------------------------------------------------------- 
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: 
>>>>>>> users-help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------- 
>>>>>>
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Jean-Paul Minet
>>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  
>>>>> Masse
>>>>> Université Catholique de Louvain
>>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>


-- 
____________________________________________

Shannon V. Davidson <svdavidson at charter.net>
Senior Software Engineer            Raytheon
636-479-7465 office         443-383-0331 fax
____________________________________________


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list