[GE users] qrsh fails

Jean-Paul Minet minet at cism.ucl.ac.be
Mon Jan 16 14:48:28 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,

>> Just tried with a few hosts, and the behavior is the same...
>>
> 
> Okay, so it's not a hardware problem. Can you please check the the / 
> var/log/messages on the nodes (not the messages file from SGE). What  

Here is the section which, I beleived, is linked to the error observed:

Jan 16 08:59:04 lmexec-92 kernel: Unable to handle kernel NULL pointer 
dereference at 0000000000000108 RIP:
Jan 16 08:59:04 lmexec-92 kernel: <ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1}
Jan 16 08:59:04 lmexec-92 kernel: PML4 c95df067 PGD 0
Jan 16 08:59:04 lmexec-92 kernel: Oops: 0000 [7] SMP
Jan 16 08:59:04 lmexec-92 kernel: CPU 0
Jan 16 08:59:04 lmexec-92 kernel: Pid: 25640, comm: sge_shepherd Tainted: GF U 
2.6.5-7.97-smp
Jan 16 08:59:04 lmexec-92 kernel: RIP: 0010:[<ffffffffa0269581>] 
<ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1}
Jan 16 08:59:04 lmexec-92 kernel: RSP: 0018:00000100e55cfeb8  EFLAGS: 00010246
Jan 16 08:59:04 lmexec-92 kernel: RAX: 0000000000000007 RBX: 0000010038ca6d80 
RCX: 0000000000000000
Jan 16 08:59:04 lmexec-92 kernel: RDX: 00000000ffffffea RSI: 0000000000000800 
RDI: 0000000000000000
Jan 16 08:59:04 lmexec-92 kernel: RBP: 000000000000000a R08: 00000000ffffffff 
R09: 00000000ffffffff
Jan 16 08:59:04 lmexec-92 kernel: R10: 0000000000000000 R11: 0000000000000206 
R12: 000001006f9c4800
Jan 16 08:59:04 lmexec-92 kernel: R13: 0000010038ca6d80 R14: 0000000000000000 
R15: 0000010038ca7088
Jan 16 08:59:04 lmexec-92 kernel: FS:  0000002a95bfd8a0(0000) 
GS:ffffffff804e7e00(0000) knlGS:000000005556c9a0
Jan 16 08:59:04 lmexec-92 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 16 08:59:04 lmexec-92 kernel: CR2: 0000000000000108 CR3: 0000000000101000 
CR4: 00000000000006e0
Jan 16 08:59:04 lmexec-92 kernel: Process sge_shepherd (pid: 25640, threadinfo 
00000100e55ce000, task 000001007939e490)
Jan 16 08:59:04 lmexec-92 kernel: Stack: 0000010038ca6d80 ffffffffa02638bb 
000000000000003b 0000010038ca6d80
Jan 16 08:59:04 lmexec-92 kernel:        000000000000000a 000001006f9c4800 
0000010038ca7088 0000000000000800
Jan 16 08:59:04 lmexec-92 kernel:        00000000401c67b0 ffffffffa0263af5
Jan 16 08:59:04 lmexec-92 kernel: Call 
Trace:<ffffffffa02638bb>{:ics_sdp:sdp_stop_listen+59} 
<ffffffffa0263af5>{:ics_sdp:sdp_disconnect+149}
Jan 16 08:59:04 lmexec-92 kernel:        <ffffffff8030dede>{inet_shutdown+206} 
<ffffffff802c0fcc>{sys_shutdown+76}
Jan 16 08:59:04 lmexec-92 kernel:        <ffffffff801106f4>{system_call+124}
Jan 16 08:59:04 lmexec-92 kernel:
Jan 16 08:59:04 lmexec-92 kernel: Code: 48 83 bf 08 01 00 00 00 48 89 fb 75 1a 
31 c9 ba c3 02 00 00
Jan 16 08:59:04 lmexec-92 kernel: RIP 
<ffffffffa0269581>{:ics_sdp:Sdp_StopListen+1} RSP <00000100e55cfeb8>
Jan 16 08:59:04 lmexec-92 kernel: CR2: 0000000000000108

it seems it involves sge_shepherd...

> type of network card is installed, and which modul is loaded for it?
> 
> lsmod
> lspci

Newisys NetXtreme BCM5704 Gigabit Ethernet (the standard one on Sun V20z) with 
"tg3" module loaded

thanks for your help

Jean-paul

> might give you some hints. - Reuti
> 
>>> - Is this new and worked before? As 9.0 isn't the latest of 9.x,  
>>> I'd  assume that your cluster is already in operation for some  time 
>>> now.
>>
>>
>> It never worked before.  Install is new; SGE configured and more or  
>> less working, except bits and pieces here and there, among which  
>> tight integration for mpich/ethernet interconnect; I have also  
>> trouble with the infiniband interconnect integration: the patch for  
>> mpich/infiniband and SGE tight integration, available on the HowTo  
>> site, doesn't match the version of mpich supplied and customized by  
>> the Infiniband vendor.  I am awaiting support form Infiniband  vendor 
>> to get latest mpich/mvapich version installed/customized.
>>
>> thnks & rgds
>>
>> Jean-Paul
>>
>>> -- Reuti
>>>
>>>> Would someone have an idea on how to further debug the problem  (I  
>>>> have tried using tcpdump between the submit host and the  target  
>>>> host, as well as the qmaster host and the target host, to  dig into  
>>>> communication bits, but it's getting complicated...)?
>>>>
>>>> Thks for any help
>>>>
>>>> Jean-paul
>>>>
>>>> ---- qrsh command and output ----
>>>> lemaitre /gridware/sge/bin/lx24-amd64 # qrsh -verbose -l   
>>>> mem_free=10M -l num_proc=2 -q all.q at lmexec-92 date
>>>> local configuration lemaitre.cism.ucl.ac.be not defined - using   
>>>> global configuration
>>>> your job 1788 ("date") has been submitted
>>>> waiting for interactive job to be scheduled ...
>>>> Your interactive job 1788 has been successfully scheduled.
>>>> Establishing /gridware/sge/utilbin/lx24-amd64/rsh session to  host  
>>>> lmexec-92 ...
>>>> poll: protocol failure in circuit setup
>>>> /gridware/sge/utilbin/lx24-amd64/rsh exited with exit code 1
>>>> reading exit code from shepherd ... 129
>>>>
>>>> -- 
>>>> Jean-Paul Minet
>>>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage  de  
>>>> Masse
>>>> Université Catholique de Louvain
>>>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>>>
>>>> -------------------------------------------------------------------- -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> -- 
>> Jean-Paul Minet
>> Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de  Masse
>> Université Catholique de Louvain
>> Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> 

-- 
Jean-Paul Minet
Gestionnaire CISM - Institut de Calcul Intensif et de Stockage de Masse
Université Catholique de Louvain
Tel: (32) (0)10.47.35.67 - Fax: (32) (0)10.47.34.52

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list