Issue: YOLO Model Training Extremely Slow on M3 Max

System

  • MacBook Pro with M3 Max (40-core GPU, 40GB GPU RAM)
  • 128GB unified memory
  • macOS Sequoia 15.3.2

Problem


Despite high-end hardware, YOLO model training is unexpectedly slow with poor GPU utilization. The same models train significantly faster on NVIDIA hardware.


Attempted Solutions


  • Latest Apple Silicon-optimized ML frameworks
  • Various batch sizes (4-32)
  • Both MPS and CPU backends
  • Native and Rosetta environments


Request


Please advise on optimizing YOLO training for Apple Silicon. Are there specific configurations or known limitations with MPS implementation for object detection models?


This high-end machine was purchased specifically for ML development, but current performance makes it impractical for YOLO training and other Deep Learning frameworks specified for image processing.


Thanks



Posted on May 9, 2025 7:56 PM

Reply
11 replies

May 10, 2025 12:38 PM in response to DrSaadLa

<< After all that, am I missing something out? I asked myself.   >>


Yes, you are missing something.


Models ported from other system may need optimization for the exact hardware you are running on. That is often not so simple as throwing a switch.


If you just recompiled for a different processor, it would run, but you would call it very slow. Because it is not able to immediately take advantage of the array transform processors offered by GPU and neural engine, unless either the code (or the runtime environment) is modified to support using those co-processors.

May 10, 2025 3:58 PM in response to DrSaadLa

"Apple Silicon MPS Training

With the support for Apple silicon chips integrated in the Ultralytics YOLO models, it's now possible to train your models on devices utilizing the powerful Metal Performance Shaders (MPS) framework. The MPS offers a high-performance way of executing computation and image processing tasks on Apple's custom silicon.

To enable training on Apple silicon chips, you should specify 'mps' as your device when initiating the training process. Below is an example 


https://docs.ultralytics.com/modes/train/#idle-gpu-training



May 10, 2025 8:06 AM in response to DrSaadLa

the excellent speed of AI tasks on Apple-silicon Macs depends on not using only the CPU and GPU, but also the neural engine, a specialized FAST short floating point array transform processor.


https://en.wikipedia.org/wiki/Neural_Engine


Vast improvements are possible. Here is an quick search example of a user who made a slight change that dramatically improved YOLO performance:


https://pysource.com/2023/03/28/object-detection-with-yolo-v8-on-mac-m1/



May 10, 2025 8:02 AM in response to DrSaadLa

I would contact the authors of Yolo and advise them of the problem. They need to start getting bug reports in from users and ask on their Apple Developer accounts why this is happening.


A couple things I would also keep in mind:

iCloud, Time Machine, and other auto-syncing software can affect performance. Unless you have a fiberoptic internet connection hard wired you will get lag through the asynchronous network communication happening in the background.


optimizing software do nothing of the sort. They are a placebo, because they dump temporary cache files for the system that can get corrupted. The system when running overnight and restarting the next morning will do its own cleanup better. Etrecheck can identify other memory resident programs you may want to remove if they are unwanted.


Don't let your hard drive get over 85% full. This tested benchmark has been found true of all computers, as a point of diminishing returns. Archive to external media and the Cloud any data you don't need instant access to.

May 10, 2025 10:05 AM in response to Grant Bennet-Alder

I have bough the ultimate M3 Max. I would say, I was very satisfied until recently when I had to use pre-trained models for object detection (Yolo series).


The Machine is blazingly fast with deep learning models for time series, image classification and other tasks.


But when It comes to yolo, it is significantly slow. I did a thorough research and tested many solutions (different versions of python, torch, torch-nightly , different parameters such as batch size, n-workers ...) nothing worked.


What disappointed only one model took 24 hours will it took one hour on NVIDIA gpu (I tested that on Kaggle) (Since I have need more memory, I need to train that model on my local machine)


The issue is when using ensemble models or different model architectures, that would take weeks or even months? Is that logical?


After all that, am I missing something out? I asked myself.


With all due respect, some available solutions on other platforms such as the video on YouTube are naively simple.

May 10, 2025 1:42 PM in response to DrSaadLa

DrSaadLa wrote:

Please advise on optimizing YOLO training for Apple Silicon. Are there specific configurations or known limitations with MPS implementation for object detection models?

No one here knows anything about YOLO. It is the developer's responsibility to ensure their product works on Apple platforms. Maybe they do that, maybe they don't.

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Issue: YOLO Model Training Extremely Slow on M3 Max

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.