[GE users] qrsh fails

Jean-Paul Minet minet at cism.ucl.ac.be
Mon Jan 16 15:33:19 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Ooops, sorry, I messed up my reply....  we are not talking about user programs 
using mpich but a simple qrsh command.

I am puzzled by the fact the qrsh interacts somewhere with infiniband code.  How 
can that be? The command issued is on the submit host:

qrsh -verbose -l mem_free=10M -l num_proc=2 -q all.q at lmexec-92 date

Now lmexec-92 is a dns hostname, with a specific IP address reachable by 
ethernet, not IPoIB.  What would make sge_shepherd communicate across IPoIB?  I 
missed a config/parameter somwhere?

Jean-paul

Shannon V. Davidson wrote:
> Jean-Paul,
> 
> It appears that you're blowing up in the ics_sdp module, which is the  
> Infiniband Sockets Direct Protocol driver.  If the sge_shepherd is 
> communicating to the master across IPoIB, you might try turning off 
> Sockets Direct or try running over ethernet.  You might also want to 
> report this problem to whomever supplied you with your Infiniband software.
> 
> Shannon
> 
> 
> Jean-Paul Minet wrote:
> 
>> Reuti,
>>
>>>> Just tried with a few hosts, and the behavior is the same...
>>>>
>>>
>>> Okay, so it's not a hardware problem. Can you please check the the / 
>>> var/log/messages on the nodes (not the messages file from SGE). What  
>>
>>
>>
>> Here is the section which, I beleived, is linked to the error observed:
>>
>> Jan 16 08:59:04 lmexec-92 kernel: Unable to handle kernel NULL pointer 
>> dereference at 0000000000000108 RIP:
>> Jan 16 08:59:04 lmexec-92 kernel: 
>> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1}
>> Jan 16 08:59:04 lmexec-92 kernel: PML4 c95df067 PGD 0
>> Jan 16 08:59:04 lmexec-92 kernel: Oops: 0000 [7] SMP
>> Jan 16 08:59:04 lmexec-92 kernel: CPU 0
>> Jan 16 08:59:04 lmexec-92 kernel: Pid: 25640, comm: sge_shepherd 
>> Tainted: GF U 2.6.5-7.97-smp
>> Jan 16 08:59:04 lmexec-92 kernel: RIP: 0010:[<ffffffffa0269581>] 
>> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1}
>> Jan 16 08:59:04 lmexec-92 kernel: RSP: 0018:00000100e55cfeb8  EFLAGS: 
>> 00010246
>> Jan 16 08:59:04 lmexec-92 kernel: RAX: 0000000000000007 RBX: 
>> 0000010038ca6d80 RCX: 0000000000000000
>> Jan 16 08:59:04 lmexec-92 kernel: RDX: 00000000ffffffea RSI: 
>> 0000000000000800 RDI: 0000000000000000
>> Jan 16 08:59:04 lmexec-92 kernel: RBP: 000000000000000a R08: 
>> 00000000ffffffff R09: 00000000ffffffff
>> Jan 16 08:59:04 lmexec-92 kernel: R10: 0000000000000000 R11: 
>> 0000000000000206 R12: 000001006f9c4800
>> Jan 16 08:59:04 lmexec-92 kernel: R13: 0000010038ca6d80 R14: 
>> 0000000000000000 R15: 0000010038ca7088
>> Jan 16 08:59:04 lmexec-92 kernel: FS:  0000002a95bfd8a0(0000) 
>> GS:ffffffff804e7e00(0000) knlGS:000000005556c9a0
>> Jan 16 08:59:04 lmexec-92 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
>> 000000008005003b
>> Jan 16 08:59:04 lmexec-92 kernel: CR2: 0000000000000108 CR3: 
>> 0000000000101000 CR4: 00000000000006e0
>> Jan 16 08:59:04 lmexec-92 kernel: Process sge_shepherd (pid: 25640, 
>> threadinfo 00000100e55ce000, task 000001007939e490)
>> Jan 16 08:59:04 lmexec-92 kernel: Stack: 0000010038ca6d80 
>> ffffffffa02638bb 000000000000003b 0000010038ca6d80
>> Jan 16 08:59:04 lmexec-92 kernel:        000000000000000a 
>> 000001006f9c4800 0000010038ca7088 0000000000000800
>> Jan 16 08:59:04 lmexec-92 kernel:        00000000401c67b0 
>> ffffffffa0263af5
>> Jan 16 08:59:04 lmexec-92 kernel: Call 
>> Trace:<ffffffffa02638bb>{:ics_sdp:sdp_stop_listen+59} 
>> <ffffffffa0263af5>{:ics_sdp:sdp_disconnect+149}
>> Jan 16 08:59:04 lmexec-92 kernel:        
>> <ffffffff8030dede>{inet_shutdown+206} <ffffffff802c0fcc>{sys_shutdown+76}
>> Jan 16 08:59:04 lmexec-92 kernel:        
>> <ffffffff801106f4>{system_call+124}
>> Jan 16 08:59:04 lmexec-92 kernel:
>> Jan 16 08:59:04 lmexec-92 kernel: Code: 48 83 bf 08 01 00 00 00 48 89 
>> fb 75 1a 31 c9 ba c3 02 00 00
>> Jan 16 08:59:04 lmexec-92 kernel: RIP 
>> <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1} RSP <00000100e55cfeb8>
>> Jan 16 08:59:04 lmexec-92 kernel: CR2: 0000000000000108
>>
>> it seems it involves sge_shepherd...
>>
>>> type of network card is installed, and which modul is loaded for it?
>>>
>>> lsmod
>>> lspci
>>
>>
>>
>> Newisys NetXtreme BCM5704 Gigabit Ethernet (the standard one on Sun 
>> V20z) with "tg3" module loaded
>>
>> thanks for your help
>>
>> Jean-paul
>>
>>> might give you some hints. - Reuti
>>>
>>>>> - Is this new and worked before? As 9.0 isn't the latest of 9.x,  
>>>>> I'd  assume that your cluster is already in operation for some  
>>>>> time now.
>>>>
>>>>
>>>>
>>>>
>>>> It never worked before.  Install is new; SGE configured and more or  
>>>> less working, except bits and pieces here and there, among which  
>>>> tight integration for mpich/ethernet interconnect; I have also  
>>>> trouble with the infiniband interconnect integration: the patch for  
>>>> mpich/infiniband and SGE tight integration, available on the HowTo  
>>>> site, doesn't match the version of mpich supplied and customized by  
>>>> the Infiniband vendor.  I am awaiting support form Infiniband  
>>>> vendor to get latest mpich/mvapich version installed/customized.
>>>>
>>>> thnks & rgds
>>>>
>>>> Jean-Paul
>>>>
>>>>> -- Reuti
>>>>>
>>>>>> Would someone have an idea on how to further debug the problem  
>>>>>> (I  have tried using tcpdump between the submit host and the  
>>>>>> target  host, as well as the qmaster host and the target host, to  
>>>>>> dig into  communication bits, but it's getting complicated...)?
>>>>>>
>>>>>> Thks for any help
>>>>>>
>>>>>> Jean-paul
>>>>>>
>>>>>> ---- qrsh command and output ----
>>>>>> lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l   
>>>>>> mem_free=10M -l num_proc=2 -q all.q at lmexec-92 date
>>>>>> local configuration lemaitre.cism.ucl.ac.be not defined - using   
>>>>>> global configuration
>>>>>> your job 1788 ("date") has been submitted
>>>>>> waiting for interactive job to be scheduled ...
>>>>>> Your interactive job 1788 has been successfully scheduled.
>>>>>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to  
>>>>>> host  lmexec-92 ...
>>>>>> poll: protocol failure in circuit setup
>>>>>> /gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
>>>>>> reading exit code from shepherd ... 129
>>>>>>
>>>>>> -- 
>>>>>> Jean-Paul Minet
>>>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage  
>>>>>> de  Masse
>>>>>> Université Catholique de Louvain
>>>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>>>
>>>>>> -------------------------------------------------------------------- 
>>>>>> -
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> Jean-Paul Minet
>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  
>>>> Masse
>>>> Université Catholique de Louvain
>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>
> 
> 

-- 
Jean-Paul Minet
Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de Masse
Université Catholique de Louvain
Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list