[GE users] Nodes dying with "Kernel panic" after upgrade to 6.1u2

Reuti reuti at staff.uni-marburg.de
Sat Sep 1 09:55:04 BST 2007


Am 31.08.2007 um 16:59 schrieb Schenker, Martin:

> That's what we thought as well, but 7 nodes out of 20 with EXACTLY the
> same error message?

We just ordered 4 PCs and started with one to perfom some local  
computations using the quantum chemistry application Molpro - after  
around 2 days the PC froze completely. Okay, we thought of a hardware  
problem und used the next one - same problem after the same  
timeframe. And you can guess what happened with the third PC... But  
it happened only with this application - using others were fine.

So just this one application was triggering something special inside  
the kernel.

This was with a Debian installation and after looking for some  
updates, we finally upgraded the kernel and then it was working fine  
on all of the machines. As a conclusion: it might really be the case,  
that SGE 6.1u2 is triggering something special inside the kernel,  
which the version before isn't using at all.

Also in your case it's IMO a problem to be fixed in the OS, or maybe  
even in the BIOS. Do you have any chance to upgrade one or two  
machines to a newer kernel/OS version?

> We've been running the kernel for the last 8 month, it's a Lustre  
> client
> kernel
> (CentOS4.4; 2.6.9-42.0.10.EL_SFS2.2_1smp on an x86_64).
>
> The only thing we change yesterday was the upgrade to 6.1u2. Now after
> the rollback to 6.1, the system seems to behave again...
>
>> From /var/log/mcelog (only one node, the others are displaying the  
>> same
> message as well):
>
>
> MCE 5
> CPU 0 4 northbridge TSC 31dabe2461ea
> ADDR bfc10000
>   Northbridge GART error
>        bit61 = error uncorrected
>   TLB error 'generic transaction, level generic'
> STATUS a40000000005001b MCGSTATUS 0
>
> MCE 0
> CPU 2 4 northbridge TSC 285e6a6fa94d2
> ADDR a5755780
> Northbridge Chipkill ECC error
> Chipkill ECC syndrome = 20e8
>        bit32 = err cpu0
>        bit46 = corrected ecc error
>        bus error 'local node origin, request didn't time out
>        generic read mem transaction
>        memory access, level generic'
>  STATUS 9474400120080813 MCGSTATUS 0
>
> These seem to be the non-lethal errors. With a kernel panic I doubt  
> that
> there would be anything except a core file (and there's none).

The core files you usually get with a segmentation fault of an  
application. With a kernel panic, the kernel core need to be dumped  
to another machine via network, as you can't write locally anything  
anymore.

-- Reuti


>  Best, Martin
>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: 31 August 2007 12:24
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Nodes dying with "Kernel panic" after
>> upgrade to 6.1u2
>>
>> Hi,
>>
>> Am 31.08.2007 um 12:24 schrieb Schenker, Martin:
>>
>>> Yesterday we upgraded to 6.1u2 from 6.1. Shortly thereafter nodes
>>> started to die with "Kernel panic" messages on the screen:
>>>
>>>
>>> CPU 0: Machine Check Exception:	4 Bank 4: b200000000070f0f
>>> TSC 4ab5ba5354b
>>> Kernel panic - not syncing: Machine check
>>
>> this looks more like a) a kernel problem or b) a hardware problem.
>> Any upgrades to the kernel besides the SGE upgrade - which
>> Linux and kernel version are you using?
>>
>> Is there anything in /var/log/mcelog?
>>
>> -- Reuti
>>
>>
>>> _
>>>
>>> We've now rolled back to 6.1 and are still testing. No hang-ups so
>>> far... Has anyone seen a similar behaviour?
>>> We're running AMD64 Opterons (HP DL145) nodes with the
>>> sge-6.1-bin-lx24-amd64 code.
>>>
>>> Best, Martin
>>>
>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list