[GE users] Nodes dying with "Kernel panic" after upgrade to 6.1u2

Schenker, Martin MSchenker at illumina.com
Mon Sep 3 09:37:09 BST 2007


Hi Reuti!

We've tried the BIOS upgrade already, no luck. Nodes died again after
being loaded with the latest BIOS version. Kernel update is currently
not possible, we're bound to this kernel due to the SFS patch. Next SFS
release might help, but this will be a while before it's out AND we
implement it. In the meantime, we'll stick to 6.1.

Is 6.1u1 showing the same behaviour?

Cheers, Martin

> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de] 
> Sent: 01 September 2007 09:55
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Nodes dying with "Kernel panic" after 
> upgrade to 6.1u2
> 
> Am 31.08.2007 um 16:59 schrieb Schenker, Martin:
> 
> > That's what we thought as well, but 7 nodes out of 20 with 
> EXACTLY the 
> > same error message?
> 
> We just ordered 4 PCs and started with one to perfom some 
> local computations using the quantum chemistry application 
> Molpro - after around 2 days the PC froze completely. Okay, 
> we thought of a hardware problem und used the next one - same 
> problem after the same timeframe. And you can guess what 
> happened with the third PC... But it happened only with this 
> application - using others were fine.
> 
> So just this one application was triggering something special 
> inside the kernel.
> 
> This was with a Debian installation and after looking for 
> some updates, we finally upgraded the kernel and then it was 
> working fine on all of the machines. As a conclusion: it 
> might really be the case, that SGE 6.1u2 is triggering 
> something special inside the kernel, which the version before 
> isn't using at all.
> 
> Also in your case it's IMO a problem to be fixed in the OS, 
> or maybe even in the BIOS. Do you have any chance to upgrade 
> one or two machines to a newer kernel/OS version?
> 
> > We've been running the kernel for the last 8 month, it's a Lustre 
> > client kernel (CentOS4.4; 2.6.9-42.0.10.EL_SFS2.2_1smp on 
> an x86_64).
> >
> > The only thing we change yesterday was the upgrade to 
> 6.1u2. Now after 
> > the rollback to 6.1, the system seems to behave again...
> >
> >> From /var/log/mcelog (only one node, the others are displaying the 
> >> same
> > message as well):
> >
> >
> > MCE 5
> > CPU 0 4 northbridge TSC 31dabe2461ea
> > ADDR bfc10000
> >   Northbridge GART error
> >        bit61 = error uncorrected
> >   TLB error 'generic transaction, level generic'
> > STATUS a40000000005001b MCGSTATUS 0
> >
> > MCE 0
> > CPU 2 4 northbridge TSC 285e6a6fa94d2
> > ADDR a5755780
> > Northbridge Chipkill ECC error
> > Chipkill ECC syndrome = 20e8
> >        bit32 = err cpu0
> >        bit46 = corrected ecc error
> >        bus error 'local node origin, request didn't time out
> >        generic read mem transaction
> >        memory access, level generic'
> >  STATUS 9474400120080813 MCGSTATUS 0
> >
> > These seem to be the non-lethal errors. With a kernel panic I doubt 
> > that there would be anything except a core file (and there's none).
> 
> The core files you usually get with a segmentation fault of 
> an application. With a kernel panic, the kernel core need to 
> be dumped to another machine via network, as you can't write 
> locally anything anymore.
> 
> -- Reuti
> 
> 
> >  Best, Martin
> >
> >> -----Original Message-----
> >> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> >> Sent: 31 August 2007 12:24
> >> To: users at gridengine.sunsource.net
> >> Subject: Re: [GE users] Nodes dying with "Kernel panic" 
> after upgrade 
> >> to 6.1u2
> >>
> >> Hi,
> >>
> >> Am 31.08.2007 um 12:24 schrieb Schenker, Martin:
> >>
> >>> Yesterday we upgraded to 6.1u2 from 6.1. Shortly thereafter nodes 
> >>> started to die with "Kernel panic" messages on the screen:
> >>>
> >>>
> >>> CPU 0: Machine Check Exception:	4 Bank 4: b200000000070f0f
> >>> TSC 4ab5ba5354b
> >>> Kernel panic - not syncing: Machine check
> >>
> >> this looks more like a) a kernel problem or b) a hardware problem.
> >> Any upgrades to the kernel besides the SGE upgrade - which 
> Linux and 
> >> kernel version are you using?
> >>
> >> Is there anything in /var/log/mcelog?
> >>
> >> -- Reuti
> >>
> >>
> >>> _
> >>>
> >>> We've now rolled back to 6.1 and are still testing. No 
> hang-ups so 
> >>> far... Has anyone seen a similar behaviour?
> >>> We're running AMD64 Opterons (HP DL145) nodes with the
> >>> sge-6.1-bin-lx24-amd64 code.
> >>>
> >>> Best, Martin
> >>>
> >>>
> >> 
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>>
> >>
> >> 
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> >>
> >>
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list