Help me understand ECC errors on 6,1 nMP

Question

Level 4

2,941 points

Help me understand ECC errors on 6,1 nMP

Having graphics glitches, stalls, restart issues with one of our 6,1 nMPs .. in short, can a bad logic board cause faulty RAM / ECC error reports?

To troubleshoot, clean installed a few different OSx, and problems persisted ... ran Apple diagnostic test, which came back clean

Applecare (via phone, so they haven't touched the machine yet to run their own tests) has diagnosed it as a likely logic board issue, but I later noticed 3 of the 4 DIMM slots showed ECC errors

I re-seated all 4 and ran 2 memtests (OSX and Windows via boot camp), both reported no issues

But, I've read 2 things that confuse me - these memtests don't work well for ECC ram and once you get ECC errors, that's indicative of a RAM hardware issue so while re-seating my solve the issue temporarily, the errors will return ie the RAM is still defective

Or, could my memory truly be ok, and it is a bad logic board that's spitting out inaccurate error reports

I'm trying to understand if we should be replacing the logic board or the RAM

Mac Pro (Late 2013), OS X El Capitan (10.11.2)

Posted on Jan 15, 2016 3:25 AM

Reply

Answer 1

Jan 15, 2016 6:57 AM in response to turbostar

It's most likely either getting too hot inside, or you have Bad DIMMs. These DIMMs are much larger, and therefore have many more transistors in them than say, PowerPC DIMMs, and that makes them more failure prone.

Reply

Answer 2

Jan 15, 2016 1:46 PM in response to turbostar

Mac Pro with Xeon processor contains a Hardware Assist for Error Correcting Code (ECC) memory modules. These modules have eight additional bits used as syndrome bits. On Write, the Xeon processor computes combinations of parity on carefully selected subsets of bits. On read, single-bit errors are detected, and the syndrome bits determine exactly which bit is in error. These are corrected on the fly in one stretched memory cycle, and also set a status bit, which is later read by a background process and tabulated.

Most Double-bit errors cannot be corrected, so they cause the processor to halt on a distinctive kernel panic, machine check, often detected by multiple processors. This is by design, to keep the error from poisoning your data.

If you are getting kernel panic, machine check, detected by multiple processor, with a few other indicators, you have some DIMMs that are Bad in their current state. Re-seating the modules is used to clear corrosion on the contacts. This is seldom effective in these DIMMs, because that is rarely the problem.

This System Report, accessible through About This Mac, can show you the current conditions, including a snapshot of the error counters for each DIMM:

this graphic from anandtech.com is from an older Mac, but yours will be similar except for the references to risers and FBDIMMs, which are not used in your late 2013 model. Its information is STATIC, and you need to invoke it again to get fresh data.

There is no better memory test available than having your Xeon processor check each and every Read from memory. There is no more important pattern than your normal workflow. Artificial memory tests may or may not understand that Error Correction Hardware is being used, so their results typically is "No fault Found", because single-bit errors are corrected.

The error correction Hardware is used aggressively at Startup. ANY errors (correctible or not) that occur during those few seconds of the Power ON Self Test cause the slot to be declared "empty". The operating system will not use that DIMM. Next power-on, the test may pass, but that does not mean the module has healed and is fine now -- It is still bad.

Reply

Answer 3

turbostar Author

Level 4

2,941 points

Jan 15, 2016 8:28 PM in response to Grant Bennet-Alder

Very informative, thank you

Apple has it, looks like they are going to replace the logic board and the GPU depending on their testing, we'll see

Reply

Answer 4

Mar 30, 2017 10:44 PM in response to turbostar

Hi,

I'm experiencing the same exact issue you're describing with the same exact computer. Were you able to solve it? Was it the logic board after all causing this issue?

Thanks!

Reply