Apple Vision Pro is now available in the U.S.

Looks like no one’s replied in a while. To start the conversation again, simply ask a new question.

Mac Pro 5,1 - High Sierra - 970 EVO Plus NVMe on 4x PCIe adapter - getting periodic kernel panic “NVMe: Command timed-out and request found in the completion queue…”

Hi. I’ve been trying to upgrade the storage and RAM in my Mac Pro 5,1 (Mid 2010).


After installing a 970 EVO Plus 1TB NVMe blade (https://www.samsung.com/semiconductor/minisite/ssd/product/consumer/970evoplus/) in a Micro Connectors Low-Profile M.2 NVMe PCIe x4 adapter (http://microconnectors.com/micro-connectors-low-profile-m-2-nvme-ssd-to-pcie-x4-adapter-with-heat-sink-for-1u )


I am periodically getting the following kernel panic:


panic(cpu 0 caller 0xffffff7f9d993e2b): nvme: " NVMe: Command timed-out and request found in the completion queue \n”@/BuildRoot/Library/Caches/com.apple.xbs/Sources/IONVMeFamily/IONVMeFamily-356.71.1/IONVMeController.cpp:5184


I can trigger this 100% of the time by running Blackmagic Disk Speed Test on the 970 NVMe APFS Volume - during which the Write test will start to ramp up the speedometer (to approx 600-750) and then hang and 15-30 seconds later the Mac Pro reboots.


Periodically the panic will just occur spontaneously without any user interaction with the NVMe volume.


I’ve also copied 100GB of directories/files to/from/within the APFS Volume on the 970 NVMe and sometimes that works ok and sometimes it hits the same stall/timeout/panic.


When running AJA System Test Lite, the write test will sometimes (but not always) stall for 1-2 seconds and then proceed to complete - leaving the write rate in the 700 range rather than 1500 - but it has never done a stall/panic as of yet.


Notes:

  • the cMP is quite bare-bones right now (HD 5780 / 1x SATA SSD in HDD tray / 970 NVMe blade in PCIe 4x adapter card / 48GB RAM)
  • ROM version is currently 140.0.0.0.0 - I started to update to Mojave and did the corresponding ROM update - but then decided I’d rather have access to the boot screen/options since the HD 5780 gives me sufficient display resolution - so I’m booting High Sierra 10.13.6 from a SSD in a HDD tray right now (eventually will put it in a PCIe/SSD-SATA card)
  • There is NO heatsink installed on the 970 NVMe blade - but it is barely warm anyways.
  • Panic issue occurs with either the HD 5780 card installed or a GT120 card installed
  • Panic issue occurs with either the original 4x 4GB RAM sticks or with the upgraded 3x 16GB RAM sticks in slots 1/2/3 (and I ran memtest overnight with the 16GB sticks without any errors)
  • I’ve done a reinstall of High Sierra and also reset NVRAM/SMC multiple times
  • I moved the 970 NVMe stick/PCIe-card over to a Win7 box - deleted the partitions (diskpart) - created an NTFS volume - ran Samsung Magician which says the 970 firmware is the Latest - ran the Samsung Magician performance test with no issues - ran the “winsat disk -drive <drive-letter> ” disk performance test a few times with no issues - copied 50GB-100GB of data to/from/within the NVMe volume with no issues - also tried to run the Win version of Blackmagic, but without any actually Blackmagic cards install was unable to select a target drive
  • I also tried another Micro Connectors NVME PCIe x4 adapter and the issue still occurs (http://microconnectors.com/m-2-nvme-80mm-ssd-pcie-x4-adapter-with-covered-heat-sink/)
  • I do have a Lycom DT-120 M.2 PCIe card arriving tomorrow and will test with that (though I doubt that will resolve things given that it looks to be just a pass-through between the NVMe blade and the MB PCIe x4 slot)


Does anyone have any suggestions on next steps?


Thanks,

Jim


(posting full panic report in the next post - seems to be hitting a 5000 char posting limit)


Anonymous UUID:       0F0C8485-1791-E425-4AEA-25C524345D04

Fri Mar  8 09:06:49 2019

*** Panic Report ***

panic(cpu 0 caller 0xffffff7f9d993e2b): nvme: " NVMe: Command timed-out and request found in the completion queue \n"@/BuildRoot/Library/Caches/com.apple.xbs/Sources/IONVMeFamily/IONVMeFamily-356.71.1/IONVMeController.cpp:5184

Backtrace (CPU 0), Frame : Return Address

...


Posted on Mar 8, 2019 4:30 PM

Reply

Similar questions

25 replies

Mar 8, 2019 4:36 PM in response to jdub9

(and here's the kernel panic dump)


Anonymous UUID:       0F0C8485-1791-E425-4AEA-25C524345D04

Fri Mar  8 09:06:49 2019

*** Panic Report ***

panic(cpu 0 caller 0xffffff7f9d993e2b): nvme: " NVMe: Command timed-out and request found in the completion queue \n"@/BuildRoot/Library/Caches/com.apple.xbs/Sources/IONVMeFamily/IONVMeFamily-356.71.1/IONVMeController.cpp:5184


Backtrace (CPU 0), Frame : Return Address


0xffffff85820cbb60 : 0xffffff801b06e1c6 

0xffffff85820cbbb0 : 0xffffff801b196a74 

0xffffff85820cbbf0 : 0xffffff801b188d44 

0xffffff85820cbc60 : 0xffffff801b0201e0 

0xffffff85820cbc80 : 0xffffff801b06dc3c 

0xffffff85820cbdb0 : 0xffffff801b06d9fc 

0xffffff85820cbe10 : 0xffffff7f9d993e2b 

0xffffff85820cbe30 : 0xffffff801b69f7ac 

0xffffff85820cbea0 : 0xffffff801b69f6d6 

0xffffff85820cbed0 : 0xffffff801b0a7624 

0xffffff85820cbf40 : 0xffffff801b0a7185 

0xffffff85820cbfa0 : 0xffffff801b01f557 

      Kernel Extensions in backtrace:

         com.apple.iokit.IONVMeFamily(2.1)[1170C79B-9E09-3CD3-970B-C419EBF9037F]@0xffffff7f9d97f000->0xffffff7f9d9befff

            dependency: com.apple.driver.AppleMobileFileIntegrity(1.0.5)[F314E6BA-45C5-3D8B-A3BF-C4CAE13DDADC]@0xffffff7f9bfaf000

            dependency: com.apple.iokit.IOPCIFamily(2.9)[D91E9813-9717-31B8-BFE5-2F3A00F375F3]@0xffffff7f9b894000

            dependency: com.apple.driver.AppleEFINVRAM(2.1)[F35A52E2-CF80-3BA9-92B5-25EFE216094F]@0xffffff7f9be32000

            dependency: com.apple.iokit.IOStorageFamily(2.1)[F27A8A2A-6662-3608-83BD-415037509E01]@0xffffff7f9bb7c000

            dependency: com.apple.iokit.IOReportFamily(31)[D2F2FBDF-4EE4-38BA-99F5-B699F886F413]@0xffffff7f9c377000

BSD process name corresponding to current thread: kernel_task

Mac OS version:

17G5019


Mar 8, 2019 6:02 PM in response to jdub9

17G5019 is only High Sierra. There is another set of firmware updates required to get to Mojave, and it not clear which contain the updates required for NVME drives as a Boot drive. What iS your firmware version now? MP510084.B00 or better?


Have you activated TRIM by using


trimforce enable


... reading the riot act and following the prompts? It may not allow TRIM on that device anyway, but it is another thing to consider.

Mar 8, 2019 6:14 PM in response to Grant Bennet-Alder

First - what is the best way to post the entire Panic Report and Etrechk here to this thread? It's well beyond the "5000 character" post limit I keep hitting.


Firmware version is actually at 140.0.0.0.0 - since I did install Mojave at one point and did it's firmware update - but I have now re-installed High Sierra. I decided to stay on High Sierra and keep my boot screens (with HD 5780 installed) but haven't yet tried to downgrade the ROM back to 0089.


Yes, I had done "trimforce enable" already and verified it for both the boot SSD (SATA 960 EVO in a HDD tray) and for the 970 NVMe drive (the problem SSD).


Mar 8, 2019 6:31 PM in response to Grant Bennet-Alder

DON'T downgrade the firmware. It appears from everything we have seen here to be fully backward-compatible.


No need to test memory any more on a Mac Pro with Xeon Processor. It has error correcting memory and will fix on the fly and keep going, until it halts with a double-bit error, which is a very distinctive machine-check report.

Mar 8, 2019 6:55 PM in response to Grant Bennet-Alder

Thx. Yeah, def not going to do a firmware downgrade wrt drilling in on characterizing the failure. I did read that 140.0.0.0.0 was backward compatible with High Sierra. And will gladly avoid any chance at bricking the tower and having to recover firmware.


It was 3x 16GB sticks off of eBay - thought I'd at least run memtest overnight on them for yucks - not sure how stressful memtest actually is tho.


Mar 8, 2019 6:57 PM in response to jdub9

Nothing in your reports jumps out at me as an obvious problem, but give it some time and others will read and maybe they will see something I don't.


I don't like to install a 4x card in slot 2, because it takes the heat of the graphics card from slot 1 and is a 16x slot. I would rather leave slot empty and let some more air move through. That may just be me. Slots 3 and 4 are 4x slots, but have full length connectors to support larger cards if needed.


If you can change Etrecheck permissions to allow "Full disk access", it can provide a summary of all panic and similar diagnostic reports from the last 7 days -- much easier than reading them individually, and sometimes can be helpful.

Mar 8, 2019 9:01 PM in response to jdub9

I don't know if those NVMe drives support any SMART or health reporting, but you might want to see if DriveDX is able to view the SMART Attributes (or "Health Indicators") for the drives. If it cannot, you could try using the "smartctl" app directly on the command line. The SSDs and even the PCIe card, may not allow or even support those commands. I know "smartctl" does support the reading of an NVMe drive's attributes if an SSD supplies them. I would check the SATA SSD as well just in case. Feel free to post screenshots of the "Health Indicators" as DriveDX doesn't always interpret an SSD's health. Besides a drive's health, they sometimes report on power & communication errors.


Have you checked the System logs at the time of the incidents to see if anything shows up? It is a long shot trying to get anything useful from those logs these days, but may be worth looking.


I forget how those PCIe slots are linked, but I agree with Grant. Besides the heat factor which is not good for an SSD, I would not want it to be sharing any resources with the GPU which could cause conflicts. IIRC Apple had their RAID cards installed in the bottom most slot, so might be a good choice for your card as well.


Which drive is the OS installed? Have you tried using HFS+ on the NVMe? APFS is new and perhaps it has issues with third party drives, especially through a third party PCIe card.


You could be experiencing issues with the power supply. You could try running "mprime" in Torture Test mode to stress the CPU, memory & power supply. You may have to use the command line version as I haven't had any luck using the GUI version.


I would try to determine if you have a hardware compatibility issue or whether it is a software issue. The best way to determine if you have hardware or software issue is to boot Linux and try your write tests on the NVMe drive. You know the card works in another system and this will allow you test everything in place. The quick way is to create a Knoppix Linux USB boot disk using Etcher. Or you could install Ubuntu Linux on a drive and run the tests. The SSD would have to be reformatted to standard GPT partitions without CoreStorage and use a supported file system such as EXT4 is best, but HFS (no journal,not sure about +), ntfs, FAT32, exFAT are supported as well.

Mar 12, 2019 7:05 AM in response to lllaass

Thx. I did see that and a few similar panic reports (i.e. NVMe: Command timed-out") wrt 970+ in the hackintosh forums (and also a few in MBPs).

I still have a number of experiments to run sometime this week (Lycom DT-120 card, different PCIe slot, catch something in system log, test with a Linux build, etc).


I also thought about buying a previous generation EVO 970 (not Plus) to see if it worked ok - but that wouldn't really indicate whether my current 970+ issue is due to that specific device or is a general product hw or sw incompatability.

And the channel seems to be pretty flushed of the 970 (not Plus) 500GB and 1TB units.


The whole NVMe APST area seems a bit interesting, but could be totally unrelated (and I'm not yet familiar with the NVMe protocols in general).


I am wondering if com.apple.iokit.IONVMeFamily might have an internal debug/transaction logging mechanism that could be enabled somehow.


Thx,

Jim




Mar 12, 2019 7:29 AM in response to jdub9

catch something in system log, test with a Linux build, 

You may want to perform a Secure Erase on the drive if Samsung supports this functionality and the adapter allows the commands as it will factory reset the SSD. You will need to install the "nvme-cli" package which includes a format command to Secure Erase the SSD and it can also check the SMART Attributes of the drive if supported. "smartctl" in the smartmontools package also supports reading of NVMe SMART information as well. This can be done using a Knoppix USB drive.

Mar 12, 2019 8:15 AM in response to HWTech

Secure erase was on my list of experiments to try - basically a factory reset of the NVMe SSD (unless there is another mechanism for that). I did find the info on nvme-cli and smartctl (and DriveDx) but was first going to just move the NVMe SSD in the PCIe adapter over to a Win7 box and use Samsung Magician to do the secure erase. I'll also put smartctrl and DriveDx on the cMP and get the SMART info (before and after and also after a NVMe Command Timeout failure)

Mar 12, 2019 9:46 AM in response to Grant Bennet-Alder

Caution: What is called 'secure erase' for rotating magnetic drives makes no sense for the completely digital technology used in SSD drives. It may increase wear levels with no benefit whatsoever.

A Secure Erase can sometimes correct internal issues within an SSD because it touches every NAND block and it also resets the SSD to factory defaults. I've resolved multiple SSD issues with a Secure Erase. An SSD's controller is like another computer inside the SSD and sometimes the SSD just gets stuck/confused and needs a reset like any other system.


Secure Erase is implemented in SSDs precisely to make sure all NAND blocks are reset (even the failed blocks that are no longer accessible) with minimal wear to the NAND so that a person can be sure there is no trace of their data left if encryption was not being used. An SSD swaps blocks in & out during normal use so there can be blocks hidden from user access still containing data until a TRIM or the internal Garbage Collection routines wipes the block clean. Now Apple doesn't mention any of this because Apple expects everyone to be using FileVault hence negating some of the need for a Secure Erase which cannot be done using macOS anyway.


See this OWC blog which talks about an SSD Secure Erase. It only touches on the security aspect, but a Secure Erase also has the other benefits I mentioned. Here is a Micron article providing details on Secure Erase and what effect it has on the SSD.


Since lllaas links to a possible issue with the SSD, then a Secure Erase is something useful to try.

Mar 12, 2019 9:55 AM in response to jdub9

Is there another mechanism besides Secure Erase that will essentially do a "factory reset" on the device?

No a Secure Erase (sometimes called Sanitize) is the only option. The nvme-cli utility uses the "format" command to Secure Erase the SSD. It has two options, one for just the Secure Erase aspect, and a second one for the Crypto Scramble if the drive is self-encrypting (SED). See this Micron article for some information on Secure Erase & Sanitize. It is very hard to find good information on the topic. Of course these things can vary by manufacturer as well.


You may want to use both "smartctl" and "nvme-cli" to view the SMART status as I believe they present it differently. There may even be an enhanced SMART output as well. My only experience with NVMe drives is with Apple's SSDs and they barely implement basic functions depending on the brand.

Mac Pro 5,1 - High Sierra - 970 EVO Plus NVMe on 4x PCIe adapter - getting periodic kernel panic “NVMe: Command timed-out and request found in the completion queue…”

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple ID.