New M1 Chipset, SIMD

Question

ramin-raeisi Author

Level 1

6 points

New M1 Chipset, SIMD

Hello *,

I wonder if the new M1 chipset supports SIMD intrinsic instructions or provide any similar?

Best Regards,

Ramin

MacBook

Posted on Nov 21, 2020 5:25 AM

Reply

Answer 1

Top-ranking reply

rorden

Level 1

8 points

Nov 25, 2020 5:10 AM in response to ramin-raeisi

The M1 supports Neon (128-bit) SIMD instructions. It does not support SVE SIMD instructions. Here is a benchmark where scalar C code is compared with explicitly-vectorized Neon code. No difference is observed, either reflecting that the test is constrained by the memory wall or that the Clang/LLVM compiler is automatically vectorizing scalar code. Regardless, the floating-point performance and memory bandwidth of the M1 are not to be denied.

Reply

Answer 2

leroydouglas

Level 10

199,416 points

Nov 21, 2020 7:05 AM in response to ramin-raeisi

ramin-raeisi wrote:

Hello *,

I wonder if the new M1 chipset supports SIMD intrinsic instructions or provide any similar?

Best Regards,

Ramin

What replaces x86 intrinsics for C when Apple ditches Intel ...

Reply

Answer 3

rorden

Level 1

8 points

Nov 25, 2020 6:58 AM in response to leroydouglas

For the example I provided, I used sse2neon which clones the x86-64 SIMD intrinsics (MMX, SSE, AES) with their Neon counterparts. Therefore, the only change to the C code to allow compilation on the M1 was this conditional:

#ifdef __x86_64__

#include <immintrin.h>

#else

#include "sse2neon.h"

#endif

This allows you to use the same intrinsics for both architectures. Intel provides a great guide for using the x86-64 intrinsics.

Reply

Answer 4

leroydouglas

Level 10

199,416 points

Nov 25, 2020 6:21 AM in response to rorden

rorden wrote:

The M1 supports Neon (128-bit) SIMD instructions. It does not support SVE SIMD instructions. Here is a benchmark where scalar C code is compared with explicitly-vectorized Neon code. No difference is observed, either reflecting that the test is constrained by the memory wall or that the Clang/LLVM compiler is automatically vectorizing scalar code. Regardless, the floating-point performance and memory bandwidth of the M1 are not to be denied.

Nice.

Thanks for your post rorden.

Reply