[GE users] Nodes dying with "Kernel panic" after upgrade to 6.1u2

Reuti reuti at staff.uni-marburg.de
Mon Sep 3 10:18:39 BST 2007


Hi,

Am 03.09.2007 um 10:37 schrieb Schenker, Martin:

> We've tried the BIOS upgrade already, no luck. Nodes died again after
> being loaded with the latest BIOS version. Kernel update is currently
> not possible, we're bound to this kernel due to the SFS patch. Next  
> SFS
> release might help, but this will be a while before it's out AND we
> implement it. In the meantime, we'll stick to 6.1.

okay, seems to be the only option in your case.

> Is 6.1u1 showing the same behaviour?

6.1u1 was never released.

-- Reuti


> Cheers, Martin
>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: 01 September 2007 09:55
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Nodes dying with "Kernel panic" after
>> upgrade to 6.1u2
>>
>> Am 31.08.2007 um 16:59 schrieb Schenker, Martin:
>>
>>> That's what we thought as well, but 7 nodes out of 20 with
>> EXACTLY the
>>> same error message?
>>
>> We just ordered 4 PCs and started with one to perfom some
>> local computations using the quantum chemistry application
>> Molpro - after around 2 days the PC froze completely. Okay,
>> we thought of a hardware problem und used the next one - same
>> problem after the same timeframe. And you can guess what
>> happened with the third PC... But it happened only with this
>> application - using others were fine.
>>
>> So just this one application was triggering something special
>> inside the kernel.
>>
>> This was with a Debian installation and after looking for
>> some updates, we finally upgraded the kernel and then it was
>> working fine on all of the machines. As a conclusion: it
>> might really be the case, that SGE 6.1u2 is triggering
>> something special inside the kernel, which the version before
>> isn't using at all.
>>
>> Also in your case it's IMO a problem to be fixed in the OS,
>> or maybe even in the BIOS. Do you have any chance to upgrade
>> one or two machines to a newer kernel/OS version?
>>
>>> We've been running the kernel for the last 8 month, it's a Lustre
>>> client kernel (CentOS4.4; 2.6.9-42.0.10.EL_SFS2.2_1smp on
>> an x86_64).
>>>
>>> The only thing we change yesterday was the upgrade to
>> 6.1u2. Now after
>>> the rollback to 6.1, the system seems to behave again...
>>>
>>>> From /var/log/mcelog (only one node, the others are displaying the
>>>> same
>>> message as well):
>>>
>>>
>>> MCE 5
>>> CPU 0 4 northbridge TSC 31dabe2461ea
>>> ADDR bfc10000
>>>   Northbridge GART error
>>>        bit61 = error uncorrected
>>>   TLB error 'generic transaction, level generic'
>>> STATUS a40000000005001b MCGSTATUS 0
>>>
>>> MCE 0
>>> CPU 2 4 northbridge TSC 285e6a6fa94d2
>>> ADDR a5755780
>>> Northbridge Chipkill ECC error
>>> Chipkill ECC syndrome = 20e8
>>>        bit32 = err cpu0
>>>        bit46 = corrected ecc error
>>>        bus error 'local node origin, request didn't time out
>>>        generic read mem transaction
>>>        memory access, level generic'
>>>  STATUS 9474400120080813 MCGSTATUS 0
>>>
>>> These seem to be the non-lethal errors. With a kernel panic I doubt
>>> that there would be anything except a core file (and there's none).
>>
>> The core files you usually get with a segmentation fault of
>> an application. With a kernel panic, the kernel core need to
>> be dumped to another machine via network, as you can't write
>> locally anything anymore.
>>
>> -- Reuti
>>
>>
>>>  Best, Martin
>>>
>>>> -----Original Message-----
>>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>>> Sent: 31 August 2007 12:24
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] Nodes dying with "Kernel panic"
>> after upgrade
>>>> to 6.1u2
>>>>
>>>> Hi,
>>>>
>>>> Am 31.08.2007 um 12:24 schrieb Schenker, Martin:
>>>>
>>>>> Yesterday we upgraded to 6.1u2 from 6.1. Shortly thereafter nodes
>>>>> started to die with "Kernel panic" messages on the screen:
>>>>>
>>>>>
>>>>> CPU 0: Machine Check Exception:	4 Bank 4: b200000000070f0f
>>>>> TSC 4ab5ba5354b
>>>>> Kernel panic - not syncing: Machine check
>>>>
>>>> this looks more like a) a kernel problem or b) a hardware problem.
>>>> Any upgrades to the kernel besides the SGE upgrade - which
>> Linux and
>>>> kernel version are you using?
>>>>
>>>> Is there anything in /var/log/mcelog?
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> _
>>>>>
>>>>> We've now rolled back to 6.1 and are still testing. No
>> hang-ups so
>>>>> far... Has anyone seen a similar behaviour?
>>>>> We're running AMD64 Opterons (HP DL145) nodes with the
>>>>> sge-6.1-bin-lx24-amd64 code.
>>>>>
>>>>> Best, Martin
>>>>>
>>>>>
>>>>
>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list