Help me understand ECC errors on 6,1 nMP

Having graphics glitches, stalls, restart issues with one of our 6,1 nMPs .. in short, can a bad logic board cause faulty RAM / ECC error reports?


To troubleshoot, clean installed a few different OSx, and problems persisted ... ran Apple diagnostic test, which came back clean


Applecare (via phone, so they haven't touched the machine yet to run their own tests) has diagnosed it as a likely logic board issue, but I later noticed 3 of the 4 DIMM slots showed ECC errors


I re-seated all 4 and ran 2 memtests (OSX and Windows via boot camp), both reported no issues


But, I've read 2 things that confuse me - these memtests don't work well for ECC ram and once you get ECC errors, that's indicative of a RAM hardware issue so while re-seating my solve the issue temporarily, the errors will return ie the RAM is still defective


Or, could my memory truly be ok, and it is a bad logic board that's spitting out inaccurate error reports


I'm trying to understand if we should be replacing the logic board or the RAM

Mac Pro (Late 2013), OS X El Capitan (10.11.2)

Posted on Jan 15, 2016 3:25 AM

Reply
4 replies

Jan 15, 2016 1:46 PM in response to turbostar

Mac Pro with Xeon processor contains a Hardware Assist for Error Correcting Code (ECC) memory modules. These modules have eight additional bits used as syndrome bits. On Write, the Xeon processor computes combinations of parity on carefully selected subsets of bits. On read, single-bit errors are detected, and the syndrome bits determine exactly which bit is in error. These are corrected on the fly in one stretched memory cycle, and also set a status bit, which is later read by a background process and tabulated.


Most Double-bit errors cannot be corrected, so they cause the processor to halt on a distinctive kernel panic, machine check, often detected by multiple processors. This is by design, to keep the error from poisoning your data.


If you are getting kernel panic, machine check, detected by multiple processor, with a few other indicators, you have some DIMMs that are Bad in their current state. Re-seating the modules is used to clear corrosion on the contacts. This is seldom effective in these DIMMs, because that is rarely the problem.


This System Report, accessible through About This Mac, can show you the current conditions, including a snapshot of the error counters for each DIMM:

User uploaded file


this graphic from anandtech.com is from an older Mac, but yours will be similar except for the references to risers and FBDIMMs, which are not used in your late 2013 model. Its information is STATIC, and you need to invoke it again to get fresh data.


There is no better memory test available than having your Xeon processor check each and every Read from memory. There is no more important pattern than your normal workflow. Artificial memory tests may or may not understand that Error Correction Hardware is being used, so their results typically is "No fault Found", because single-bit errors are corrected.


The error correction Hardware is used aggressively at Startup. ANY errors (correctible or not) that occur during those few seconds of the Power ON Self Test cause the slot to be declared "empty". The operating system will not use that DIMM. Next power-on, the test may pass, but that does not mean the module has healed and is fine now -- It is still bad.

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Help me understand ECC errors on 6,1 nMP

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.