[GE users] qrsh fails

Shannon V. Davidson svdavidson at charter.net
Mon Jan 16 15:12:57 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Jean-Paul,

It appears that you're blowing up in the ics_sdp module, which is the  
Infiniband Sockets Direct Protocol driver.  If the sge_shepherd is 
communicating to the master across IPoIB, you might try turning off 
Sockets Direct or try running over ethernet.  You might also want to 
report this problem to whomever supplied you with your Infiniband software.

Shannon


Jean-Paul Minet wrote:

> Reuti,
>
>>> Just tried with a few hosts, and the behavior is the same...
>>>
>>
>> Okay, so it's not a hardware problem. Can you please check the the / 
>> var/log/messages on the nodes (not the messages file from SGE). What  
>
>
> Here is the section which, I beleived, is linked to the error observed:
>
> Jan 16 08:59:04 lmexec-92 kernel: Unable to handle kernel NULL pointer 
> dereference at 0000000000000108 RIP:
> Jan 16 08:59:04 lmexec-92 kernel: 
> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1}
> Jan 16 08:59:04 lmexec-92 kernel: PML4 c95df067 PGD 0
> Jan 16 08:59:04 lmexec-92 kernel: Oops: 0000 [7] SMP
> Jan 16 08:59:04 lmexec-92 kernel: CPU 0
> Jan 16 08:59:04 lmexec-92 kernel: Pid: 25640, comm: sge_shepherd 
> Tainted: GF U 2.6.5-7.97-smp
> Jan 16 08:59:04 lmexec-92 kernel: RIP: 0010:[<ffffffffa0269581>] 
> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1}
> Jan 16 08:59:04 lmexec-92 kernel: RSP: 0018:00000100e55cfeb8  EFLAGS: 
> 00010246
> Jan 16 08:59:04 lmexec-92 kernel: RAX: 0000000000000007 RBX: 
> 0000010038ca6d80 RCX: 0000000000000000
> Jan 16 08:59:04 lmexec-92 kernel: RDX: 00000000ffffffea RSI: 
> 0000000000000800 RDI: 0000000000000000
> Jan 16 08:59:04 lmexec-92 kernel: RBP: 000000000000000a R08: 
> 00000000ffffffff R09: 00000000ffffffff
> Jan 16 08:59:04 lmexec-92 kernel: R10: 0000000000000000 R11: 
> 0000000000000206 R12: 000001006f9c4800
> Jan 16 08:59:04 lmexec-92 kernel: R13: 0000010038ca6d80 R14: 
> 0000000000000000 R15: 0000010038ca7088
> Jan 16 08:59:04 lmexec-92 kernel: FS:  0000002a95bfd8a0(0000) 
> GS:ffffffff804e7e00(0000) knlGS:000000005556c9a0
> Jan 16 08:59:04 lmexec-92 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
> 000000008005003b
> Jan 16 08:59:04 lmexec-92 kernel: CR2: 0000000000000108 CR3: 
> 0000000000101000 CR4: 00000000000006e0
> Jan 16 08:59:04 lmexec-92 kernel: Process sge_shepherd (pid: 25640, 
> threadinfo 00000100e55ce000, task 000001007939e490)
> Jan 16 08:59:04 lmexec-92 kernel: Stack: 0000010038ca6d80 
> ffffffffa02638bb 000000000000003b 0000010038ca6d80
> Jan 16 08:59:04 lmexec-92 kernel:        000000000000000a 
> 000001006f9c4800 0000010038ca7088 0000000000000800
> Jan 16 08:59:04 lmexec-92 kernel:        00000000401c67b0 
> ffffffffa0263af5
> Jan 16 08:59:04 lmexec-92 kernel: Call 
> Trace:<ffffffffa02638bb>{:ics_sdp:sdp_stop_listen+59} 
> <ffffffffa0263af5>{:ics_sdp:sdp_disconnect+149}
> Jan 16 08:59:04 lmexec-92 kernel:        
> <ffffffff8030dede>{inet_shutdown+206} <ffffffff802c0fcc>{sys_shutdown+76}
> Jan 16 08:59:04 lmexec-92 kernel:        
> <ffffffff801106f4>{system_call+124}
> Jan 16 08:59:04 lmexec-92 kernel:
> Jan 16 08:59:04 lmexec-92 kernel: Code: 48 83 bf 08 01 00 00 00 48 89 
> fb 75 1a 31 c9 ba c3 02 00 00
> Jan 16 08:59:04 lmexec-92 kernel: RIP 
> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1} RSP <00000100e55cfeb8>
> Jan 16 08:59:04 lmexec-92 kernel: CR2: 0000000000000108
>
> it seems it involves sge_shepherd...
>
>> type of network card is installed, and which modul is loaded for it?
>>
>> lsmod
>> lspci
>
>
> Newisys NetXtreme BCM5704 Gigabit Ethernet (the standard one on Sun 
> V20z) with "tg3" module loaded
>
> thanks for your help
>
> Jean-paul
>
>> might give you some hints. - Reuti
>>
>>>> - Is this new and worked before? As 9.0 isn't the latest of 9.x,  
>>>> I'd  assume that your cluster is already in operation for some  
>>>> time now.
>>>
>>>
>>>
>>> It never worked before.  Install is new; SGE configured and more or  
>>> less working, except bits and pieces here and there, among which  
>>> tight integration for mpich/ethernet interconnect; I have also  
>>> trouble with the infiniband interconnect integration: the patch for  
>>> mpich/infiniband and SGE tight integration, available on the HowTo  
>>> site, doesn't match the version of mpich supplied and customized by  
>>> the Infiniband vendor.  I am awaiting support form Infiniband  
>>> vendor to get latest mpich/mvapich version installed/customized.
>>>
>>> thnks & rgds
>>>
>>> Jean-Paul
>>>
>>>> -- Reuti
>>>>
>>>>> Would someone have an idea on how to further debug the problem  
>>>>> (I  have tried using tcpdump between the submit host and the  
>>>>> target  host, as well as the qmaster host and the target host, to  
>>>>> dig into  communication bits, but it's getting complicated...)?
>>>>>
>>>>> Thks for any help
>>>>>
>>>>> Jean-paul
>>>>>
>>>>> ---- qrsh command and output ----
>>>>> lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l   
>>>>> mem_free=10M -l num_proc=2 -q all.q at lmexec-92 date
>>>>> local configuration lemaitre.cism.ucl.ac.be not defined - using   
>>>>> global configuration
>>>>> your job 1788 ("date") has been submitted
>>>>> waiting for interactive job to be scheduled ...
>>>>> Your interactive job 1788 has been successfully scheduled.
>>>>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to  
>>>>> host  lmexec-92 ...
>>>>> poll: protocol failure in circuit setup
>>>>> /gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
>>>>> reading exit code from shepherd ... 129
>>>>>
>>>>> -- 
>>>>> Jean-Paul Minet
>>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage  
>>>>> de  Masse
>>>>> Université Catholique de Louvain
>>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>>
>>>>> -------------------------------------------------------------------- 
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>> -- 
>>> Jean-Paul Minet
>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  
>>> Masse
>>> Université Catholique de Louvain
>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>


-- 
____________________________________________

Shannon V. Davidson <svdavidson at charter.net>
Senior Software Engineer            Raytheon
636-479-7465 office         443-383-0331 fax
____________________________________________


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list