Thursday, March 17, 2016

Rest in peace, SIMD intrinsics. Long live auto-vectorization!

A coworker pointed me to the article "How to Write a Math Library in 2016".

I couldn't help but think that perhaps it should have been titled "How to Write a Math Library in 2004"...

SSE came out with the Pentium III in 1999, and SSE2 came out with the Pentium 4 in 2001. Heck, SSE2 is part of the base x86_64 instruction set! You don't even need to turn on any compiler flags to get SSE2 instructions on x86_64. We should really be thinking of making our code scale to arbitrary width vectors these days -- AVX doubled the vector width in 2011, and AVX512 will double it yet again in 2018/2019 with the Ice Lake architecture.

To be fair to the article's author, he does have a point about the little-known __vectorcall ABI. It can have very measurable performance benefits, even for code that isn't heavy in vector math. It certainly did help for a processor emulator I worked on, where one of the core design choices was to use SIMD registers prolifically so we could do things like single-cycle 64-bit arithmetic on 32-bit hosts. The real problem here is that the canonical 32-bit Windows calling convention is awful, and it's the default. It's used by almost everything, and until 32-bit Windows dies off, we sort of just have to deal with it if we intend to link to code we don't own. You can't just turn on the '/Gv' flag in real projects, because some libraries fail to specify their calling convention in the function declarations and you end up getting linker errors.


That said, the SIMD intrinsics the article's author is using really need to go away. To the point my coworker made in his initial email about the article: it's unusual to see readable code come out of using a math library like this. Intrinsics are not portable, hard to write, hard to read, and hard for compilers to optimize. But even when they're hidden behind a very clean and well-thought out interface like the math library in the article, I'd argue it's still a bad implementation decision.

Most often, intrinsics map directly to some particular machine instruction (not too different from inline assembly), and that gives the compiler a lot less flexibility with instruction selection and register allocation. Few compilers are smart enough to recognize that they can choose better instructions and wider vectors if targeting newer hardware, but that's less likely to work well than plain old auto-vectorization.

In my experience, it's better to just write reference C code, let the compiler do its job, and then dig into the assembly to determine if the compiler did something stupid. If it did, it's not uncommon to find that rearranging the deck chairs with seemingly insignificant changes can yield big performance benefits. And if you see some optimization that your compiler does not, report it to your compiler vendor! To me, intrinsics should basically only be used when there's no other way to get the code you want.

For example, take the C statement '1.0f / sqrtf(x)' (reciprocal square root). On x86/x86_64, GCC notices that it can estimate the value with 'rsqrt', and can improve the estimate with a Newton-Raphson iteration. Because the compiler is smart here, the optimal implementation couldn't be simpler:

static inline float rsqrtf(float f)
{
    return 1.0f / sqrtf(f);
}

Unfortunately, GCC on ARM fails to recognize that the reciprocal square root can be estimated with the NEON instruction 'vrsqrte' and improved with 'vrsqrts'. So the optimal implementation for the target is actually much scarier, and if it wasn't for the function name, I probably wouldn't know what it was doing, either (a fine example of painful-to-read SIMD intrinsics):

static inline vf32x4_t
rcp_sqrt_nr_ps(const vf32x4_t _v) {
    v4 vec, result;
    vec.p = _v;
    result.v = vrsqrteq_f32(vec.v);
    result.v = vmulq_f32(vrsqrtsq_f32(vmulq_f32(result.v, result.v), vec.v), result.v);
    return result.p;
}


Most of the time, the key to getting auto-vectorization to work properly is to write clean, simple code and thoughtfully consider your data layout. When targeting a SIMD instruction set in a modern CPU, use astructure-of-arrays (SOA) implementation instead of an array of structures (AOS). When targeting SIMT (e.g. CUDA -- many threads with small simple chunks of code), use AOS.

For examples of this, take a look at my n-body simulation. Only the "AVX intrin" implementation shown below uses intrinsics, and it comes in second place to carefully written reference C code that the compiler was able to better optimize, by auto-vectorizing into the target's native AVX2 instructions (Haswell):

Running simulation with 262144 particles, crosscheck disabled, CPU enabled, 32 threads
      CPU_SOA:  3614.18 ms =   19.014x10^9 interactions/s (   380.28 GFLOPS)
CPU_SOA_tiled:  3086.19 ms =   22.267x10^9 interactions/s (   445.34 GFLOPS)
   AVX intrin:  3358.99 ms =   20.458x10^9 interactions/s (   409.17 GFLOPS)
      CPU_AOS:  9860.42 ms =    6.969x10^9 interactions/s (   139.38 GFLOPS)
CPU_AOS_tiled:  5843.29 ms =   11.760x10^9 interactions/s (   235.21 GFLOPS)

Aside from the variety of intrinsics implementations I included in the project, the code is entirely portable and retargetable. I could target an old 2010-era Westmere CPU, or even a non-x86 architecture like ARMv8. But much more importantly, I don't have to change anything in my code to take advantage of new instruction set features or compiler improvements when they are released. I could rebuild this code in 2018 or 2019 against the upcoming Ice Lake architecture, and it'd take advantage of AVX512 instructions.


Compilers and CPUs have changed a lot in 20 years:

Wherever you can, avoid SIMD intrinsics. Compilers are very good at auto-vectorizing code these days. Treat your compiler like a glorified assembler, and write code that's easy for it to optimize.

It's time to start questioning and potentially abandoning long-accepted 20-year old optimization techniques like loop unrolling. On modern hardware, loop unrolling will just blow your instruction and trace caches by creating unnecessarily long code sequences. Compilers are even introducing flags that would have sounded crazy not too long ago, like -freroll-loops, deliberately trying to undo the hand-coded loop unrolling gymnastics and Duff's devices that people used to need to use.

In a similar vein to loop unrolling, forget bloated and overly complex memcpy() implementations that try to selectively use SSE instructions or large moves. Intel changed 'rep movsb' in Ivy Bridge to do all the hard work for you, and it frequently beats hand-optimized implementations that used to be king (see here, pages 3-91 through 3-94).