bithead2

Q: Mac Pro 2009 hangs intermittently

I have a Mac Pro (early 2009) that has lived a very cushy life because I don't use it a lot). On 10.8.5 (Mtn Lion) is started hanging intermittently. So I installed an SSD drive and did a clean OS install of Mtn Lion on it and ran updates.  It still hangs intermittently, so the OS installation/disk drives aren't the problem.

 

Here is the trace, what is the likely issue?

 

Thanks,

CJ

 

Interval Since Last Panic Report:  53 sec

Panics Since Last Report:          1

Anonymous UUID:                    4D4C7BB1-ECDA-A646-91B0-47B042CFF6AD

 

Sun Oct  4 03:22:26 2015

Machine-check capabilities 0x0000000000001c09:

family: 6 model: 26 stepping: 5 microcode: 17

Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz

9 error-reporting banks

threshold-based error status present

extended corrected memory error handling present

Processor 0: no valid machine-check state

Processor 1: no valid machine-check state

Processor 2: no valid machine-check state

Processor 3: no valid machine-check state

Processor 4: no valid machine-check state

Processor 5: no valid machine-check state

Processor 6: no valid machine-check state

Processor 7: no valid machine-check state

Processor 8: machine-check status 0x0000000000000004:

machine-check in progress

MCA error-reporting registers:

IA32_MC0_STATUS(0x401): 0x0000000000000800 invalid

IA32_MC1_STATUS(0x405): 0xbe00000000400e0f valid

  MCA error code:            0x0e0f

  Model specific error code: 0x0040

  Other information:         0x00000000

  Threshold-based status:    Undefined

  Status bits:

   Processor context corrupt

   ADDR register valid

   MISC register valid

   Error enabled

   Uncorrected error

IA32_MC1_ADDR(0x406): 0x000000e1e7e1e000

IA32_MC1_MISC(0x407): 0x0000000001000000

IA32_MC2_STATUS(0x409): 0x0000000000000000 invalid

IA32_MC3_STATUS(0x40d): 0x0000000000000000 invalid

IA32_MC4_STATUS(0x411): 0x0000000000000000 invalid

IA32_MC5_STATUS(0x415): 0x0000000000000000 invalid

IA32_MC6_STATUS(0x419): 0x0000000000000000 invalid

IA32_MC7_STATUS(0x41d): 0x0000000000000000 invalid

IA32_MC8_STATUS(0x421): 0x0000000000000000 invalid

Processor 9: machine-check status 0x0000000000000004:

machine-check in progress

MCA error-reporting registers:

IA32_MC0_STATUS(0x401): 0x0000000000000800 invalid

IA32_MC1_STATUS(0x405): 0xbe00000000400e0f valid

  MCA error code:            0x0e0f

  Model specific error code: 0x0040

  Other information:         0x00000000

  Threshold-based status:    Undefined

  Status bits:

   Processor context corrupt

   ADDR register valid

   MISC register valid

   Error enabled

   Uncorrected error

IA32_MC1_ADDR(0x406): 0x000000e1e7e1e000

IA32_MC1_MISC(0x407): 0x0000000001000000

IA32_MC2_STATUS(0x409): 0x0000000000000000 invalid

IA32_MC3_STATUS(0x40d): 0x0000000000000000 invalid

IA32_MC4_STATUS(0x411): 0x0000000000000000 invalid

IA32_MC5_STATUS(0x415): 0x0000000000000000 invalid

IA32_MC6_STATUS(0x419): 0x0000000000000000 invalid

IA32_MC7_STATUS(0x41d): 0x0000000000000000 invalid

IA32_MC8_STATUS(0x421): 0x0000000000000000 invalid

Processor 10: machine-check status 0x0000000000000004:

machine-check in progress

MCA error-reporting registers:

IA32_MC0_STATUS(0x401): 0x0000000000000800 invalid

IA32_MC1_STATUS(0x405): 0xbe00000000400e0f valid

  MCA error code:            0x0e0f

  Model specific error code: 0x0040

  Other information:         0x00000000

  Threshold-based status:    Undefined

  Status bits:

   Processor context corrupt

   ADDR register valid

   MISC register valid

   Error enabled

   Uncorrected error

IA32_MC1_ADDR(0x406): 0x000000e1e7e1e000

IA32_MC1_MISC(0x407): 0x0000000001000000

IA32_MC2_STATUS(0x409): 0x0000000000000000 invalid

IA32_MC3_STATUS(0x40d): 0x0000000000000000 invalid

IA32_MC4_STATUS(0x411): 0x0000000000000000 invalid

IA32_MC5_STATUS(0x415): 0x0000000000000000 invalid

IA32_MC6_STATUS(0x419): 0x0000000000000000 invalid

IA32_MC7_STATUS(0x41d): 0x0000000000000000 invalid

IA32_MC8_STATUS(0x421): 0x0000000000000000 invalid

Processor 11: machine-check status 0x0000000000000004:

machine-check in progress

MCA error-reporting registers:

IA32_MC0_STATUS(0x401): 0x0000000000000800 invalid

IA32_MC1_STATUS(0x405): 0xbe00000000400e0f valid

  MCA error code:            0x0e0f

  Model specific error code: 0x0040

  Other information:         0x00000000

  Threshold-based status:    Undefined

  Status bits:

   Processor context corrupt

   ADDR register valid

   MISC register valid

   Error enabled

   Uncorrected error

IA32_MC1_ADDR(0x406): 0x000000e1e7e1e000

IA32_MC1_MISC(0x407): 0x0000000001000000

IA32_MC2_STATUS(0x409): 0x0000000000000000 invalid

IA32_MC3_STATUS(0x40d): 0x0000000000000000 invalid

IA32_MC4_STATUS(0x411): 0x0000000000000000 invalid

IA32_MC5_STATUS(0x415): 0x0000000000000000 invalid

IA32_MC6_STATUS(0x419): 0x0000000000000000 invalid

IA32_MC7_STATUS(0x41d): 0x0000000000000000 invalid

IA32_MC8_STATUS(0x421): 0x0000000000000000 invalid

Processor 12: machine-check st

Model: MacPro4,1, BootROM MP41.0081.B07, 8 processors, Quad-Core Intel Xeon, 2.26 GHz, 32 GB, SMC 1.39f5

Graphics: NVIDIA GeForce GT 120, NVIDIA GeForce GT 120, PCIe, 512 MB

Memory Module: DIMM 1, 4 GB, DDR3 ECC, 1066 MHz, 0x802C, 0x33364A445A533531323732505A3147344631

Memory Module: DIMM 2, 4 GB, DDR3 ECC, 1066 MHz, 0x802C, 0x33364A425A53353132373250593147344431

Memory Module: DIMM 3, 4 GB, DDR3 ECC, 1066 MHz, 0x802C, 0x33364A425A53353132373250593147344431

Memory Module: DIMM 4, 4 GB, DDR3 ECC, 1066 MHz, 0x802C, 0x33364A425A53353132373250593147344431

Memory Module: DIMM 5, 4 GB, DDR3 ECC, 1066 MHz, 0x802C, 0x33364A425A53353132373250593147344431

Memory Module: DIMM 6, 4 GB, DDR3 ECC, 1066 MHz, 0x802C, 0x33364A425A53353132373250593147344431

Memory Module: DIMM 7, 4 GB, DDR3 ECC, 1066 MHz, 0x802C, 0x33364A445A533531323732505A3147344631

Memory Module: DIMM 8, 4 GB, DDR3 ECC, 1066 MHz, 0x802C, 0x33364A445A533531323732505A3147344631

AirPort: spairport_wireless_card_type_airport_extreme (0x14E4, 0x8E), Broadcom BCM43xx 1.0 (5.106.98.100.17)

Bluetooth: Version 6.1.7f5 15859, 3 service, 21 devices, 3 incoming serial ports

Network Service: Wi-Fi, AirPort, en2

PCI Card: NVIDIA GeForce GT 120, sppci_displaycontroller, Slot-1

PCI Card: pci1057,3410, sppci_othermultimedia, Slot-2@8,4,0

PCI Card: pci1057,3410, sppci_othermultimedia, Slot-2@8,5,0

PCI Card: pci1057,3410, sppci_othermultimedia, Slot-2@8,6,0

PCI Card: pci1a00,1, sppci_othermultimedia, Slot-4

PCI Card: pci1057,3410, sppci_othermultimedia, Slot-3@4,4,0

PCI Card: pci1057,3410, sppci_othermultimedia, Slot-3@4,5,0

PCI Card: pci1057,3410, sppci_othermultimedia, Slot-3@4,6,0

Serial ATA Device: HL-DT-ST DVD-RW GH41N

Serial ATA Device: Samsung SSD 850 EVO 500GB, 500.11 GB

Serial ATA Device: Hitachi HDS722020ALA330, 2 TB

Serial ATA Device: Hitachi HDS723030ALA640, 3 TB

Serial ATA Device: WDC WD740GD-00FLC0, 74.36 GB

USB Device: Keyboard Hub, apple_vendor_id, 0x1006, 0xfd500000 / 3

USB Device: Kensington Expert Mouse, 0x047d  (Kensington), 0x1020, 0xfd530000 / 7

USB Device: Apple Keyboard, apple_vendor_id, 0x0220, 0xfd520000 / 6

USB Device: hub_device, 0x0409  (NEC Corporation), 0x005a, 0xfd300000 / 2

USB Device: v125w, 0x03f0  (Hewlett Packard), 0x3307, 0xfd310000 / 5

USB Device: hub_device, 0x0409  (NEC Corporation), 0x005a, 0xfd340000 / 4

USB Device: hub_device, apple_vendor_id, 0x9102, 0x1a200000 / 2

USB Device: hub_device, apple_vendor_id, 0x9118, 0x1a210000 / 3

USB Device: iLok, 0x088e, 0x5036, 0x1a211000 / 5

USB Device: Studio Display, apple_vendor_id, 0x9218, 0x1a213000 / 4

USB Device: BRCM2046 Hub, 0x0a5c  (Broadcom Corp.), 0x4500, 0x5a100000 / 2

USB Device: Bluetooth USB Host Controller, apple_vendor_id, 0x8215, 0x5a110000 / 3

FireWire Device: built-in_hub, 800mbit_speed

Mac Pro, OS X Mountain Lion (10.8.5)

Posted on Oct 4, 2015 3:45 AM

Close

Q: Mac Pro 2009 hangs intermittently

  • All replies
  • Helpful answers

Page 1 Next
  • by lllaass,Helpful

    lllaass lllaass Oct 4, 2015 11:02 AM in response to bithead2
    Level 10 (188,811 points)
    Desktops
    Oct 4, 2015 11:02 AM in response to bithead2

    A machine check typically means a hardware problem. Since multiple CPUs are reporting problems that typically means a memory problem. If you go to System Profiler

    OS X: About System Information and System Profiler - Apple Support

    look under Memory and do all sticks say OK?

    You have to periodically check by doing this again since the display only updates when you open profiler.

  • by Grant Bennet-Alder,Helpful

    Grant Bennet-Alder Grant Bennet-Alder Oct 4, 2015 11:02 AM in response to bithead2
    Level 9 (60,931 points)
    Desktops
    Oct 4, 2015 11:02 AM in response to bithead2

    lllaass has it right.  This is a RAM Memory error problem. As is typical, the 'which module' information is inconclusive.

     

    The error-correction Hardware built into the Xeon Processor can correct single-bit errors on the fly. Double-bit errors cause a halt on a kernel panic, (like the one you posted) to avoid poisoning your data. This is what it might look like if you are accumulating errors on a DIMM:

     

    eccerrors2.jpg

     

    -- graphic from anandtech.com

     

    The DIMMs accumulating errors are Bad, and must be replaced, because one additional error in the same word and the machine halts on a kernel panic again.

  • by bithead2,

    bithead2 bithead2 Oct 4, 2015 11:03 AM in response to Grant Bennet-Alder
    Level 1 (0 points)
    Oct 4, 2015 11:03 AM in response to Grant Bennet-Alder

    Profiler says OK for all DIMMs.  I'll proceed on the assumption that this is a memory error and remove half of it or something to try and narrow down which group it is.  Thanks very much for the tip!  The machine has been up now for 6+ hours, which is longer than I've every seen after this problem occurred.

     

    CJ

  • by Grant Bennet-Alder,

    Grant Bennet-Alder Grant Bennet-Alder Oct 4, 2015 11:29 AM in response to bithead2
    Level 9 (60,931 points)
    Desktops
    Oct 4, 2015 11:29 AM in response to bithead2

    that report is STATIC, so it must be invoked again to get new data.

     

    Remember that the Hardware is watching each and every Read from RAM Memory, and noting any corrections that need to be made. Eventually, it will spot the problem.

     

    At Startup, the Error Correction Hardware is used very aggressively. ANY problem, correctable or not, causes the module slot to be declared "Empty". (such modules are Bad. But the next time it is tested, it may not fail and could be used again).

     

    Half Interval Search (division the suspect group in half again and again) is a good method for finding which group, and successively refining until you find the module.

     

    One downside is that removing half the memory make a very subtle change to memory timing.

  • by bithead2,

    bithead2 bithead2 Oct 4, 2015 12:30 PM in response to Grant Bennet-Alder
    Level 1 (0 points)
    Oct 4, 2015 12:30 PM in response to Grant Bennet-Alder

    Is there some memory test software that could be used to accelerate the discovery of failures?  The machine has now been up 7 hours without the error. This morning, it crashed twice, once in 50 minutes and once in 10 minutes. So it's highly variable.

    CJ

  • by Grant Bennet-Alder,

    Grant Bennet-Alder Grant Bennet-Alder Oct 4, 2015 12:55 PM in response to bithead2
    Level 9 (60,931 points)
    Desktops
    Oct 4, 2015 12:55 PM in response to bithead2

    Since the Hardware error correction is watching every read from RAM, running an artificial memory test is not any more productive than running the things you usually run.

     

    In addition, most memory tests will pass with flying colors (because they don't understand that memory error correction is fixing any problems that may be occurring).

     

    It certainly would be nice if once a memory error occurred, the module would just fall over dead and be immediately detectable. But that is not the nature of these problems. Sometimes you can run detection software for days and find nothing. as you said, it is highly variable.

  • by bithead2,

    bithead2 bithead2 Oct 4, 2015 1:00 PM in response to Grant Bennet-Alder
    Level 1 (0 points)
    Oct 4, 2015 1:00 PM in response to Grant Bennet-Alder

    Thanks. It's really hard to keep 32GB of RAM busy that's why I was hoping for the memory test.  But thanks again.

  • by bithead2,

    bithead2 bithead2 Oct 4, 2015 1:48 PM in response to Grant Bennet-Alder
    Level 1 (0 points)
    Oct 4, 2015 1:48 PM in response to Grant Bennet-Alder

    OK here is what I did. I found this:  OS X Mountain Lion: Use Apple Diagnostics or Apple Hardware Test

     

    Then I ran those tests using the CD that came with my Mac. After 20 minutes it said:

     

    Apple Hardware Test has detected an error.

     

    4MEM/9/40000006: 0x712cf298

     

    It seems to have quit testing right there.

     

    Does this mean on the Memory page that "DIMM 4" as identified in the System Profiler utility is the bad one? There are 8 4GB DIMMs installed.

     

    Thanks, Chris

  • by Grant Bennet-Alder,

    Grant Bennet-Alder Grant Bennet-Alder Oct 4, 2015 2:01 PM in response to bithead2
    Level 9 (60,931 points)
    Desktops
    Oct 4, 2015 2:01 PM in response to bithead2

    All of those messages start with 4. We have seen 4SNS fairly often.

     

    Apple has never made the interpretation of those numbers available. It is possible, but unlikely, that the genius Bar knows what those numbers mean. (When I have inquired about diagnostic information, the genius knew less about the numbers than I did.)

  • by bithead2,

    bithead2 bithead2 Oct 4, 2015 2:07 PM in response to Grant Bennet-Alder
    Level 1 (0 points)
    Oct 4, 2015 2:07 PM in response to Grant Bennet-Alder

    Ugh that *****.

  • by lllaass,

    lllaass lllaass Oct 4, 2015 2:14 PM in response to bithead2
    Level 10 (188,811 points)
    Desktops
    Oct 4, 2015 2:14 PM in response to bithead2
  • by bithead2,

    bithead2 bithead2 Oct 14, 2015 6:30 PM in response to lllaass
    Level 1 (0 points)
    Oct 14, 2015 6:30 PM in response to lllaass

    OK here is the deal. I swapped all the RAM out and put in the original RAM. Still locked up. So it isn't the memory. Installed a new SSD and new OS. Still locks up. So I took it to the Genius Bar and they ran some tests that said MCP Die sensor was a problem. They took the machine in and ran more tests. What they say is that there is a broken temperature cable and that the processor tray has to be replaced for $299. For a broken cable?  Anyway, is it possible that the fix is as simple as finding the broken cable and fixing it?

     

    Thanks,

    CJ

  • by Grant Bennet-Alder,

    Grant Bennet-Alder Grant Bennet-Alder Oct 14, 2015 7:20 PM in response to bithead2
    Level 9 (60,931 points)
    Desktops
    Oct 14, 2015 7:20 PM in response to bithead2

    put in the original RAM. Still locked up. So it isn't the memory.

    That does not follow from the evidence you presented. That just says that there may be problems in the memory you put back, or you may have additional problems.

     

    A Bad sensor cable would fail the diagnostic, and the fans would run at high speed all the time. If the cable is bad, it is like one black wire, so yes, the fix should be easy. Apple does repair-by-replacement, they don't solder anything. 

  • by bithead2,

    bithead2 bithead2 Oct 15, 2015 1:40 PM in response to Grant Bennet-Alder
    Level 1 (0 points)
    Oct 15, 2015 1:40 PM in response to Grant Bennet-Alder

    I would tend to agree with you. Except, if the machine locks up I don't see how the fans would run at a high speed. It seems like since they are under processor control that if the processors freeze nothing at all would happen to fan speed. Will report back what I find.

Page 1 Next