Yes it SIMDs. First, there’s an easy and mostly accurate way to test if an expression SIMDs or not without scrutinizing the code: run it on data with half the number of bits. Properly SIMDed code should then see a close to 2x speedup. Testing with the above code, first 64 bit floats:
julia> N = 100_000; ops = rand([MUL, DIV], N);
julia> a, b, v = 1:3 .|>_-> rand(N);
julia> @btime $v .= evaluate_op.($ops, $a, $b);
393.465 μs (0 allocations: 0 bytes)
julia> @btime $v .= evaluate_op2.($ops, $a, $b);
57.408 μs (0 allocations: 0 bytes)
Now 32 bit floats:
julia> a, b, v = 1:3 .|>_-> rand(Float32, N);
julia> @btime $v .= evaluate_op.($ops, $a, $b);
403.560 μs (0 allocations: 0 bytes)
julia> @btime $v .= evaluate_op2.($ops, $a, $b);
27.580 μs (0 allocations: 0 bytes)
So evaluate_op2
most likely SIMDs. To verify, and see how:
julia> @code_native syntax=:intel broadcast!(evaluate_op2, v, ops, a, b)
...
L1536:
vmovups ymm1, ymmword ptr [rdx + 4*rbx]
vmovups ymm2, ymmword ptr [r11 + 4*rbx]
vpcmpeqd ymm3, ymm0, ymmword ptr [rcx + 4*rbx]
vdivps ymm4, ymm1, ymm2
vmulps ymm1, ymm1, ymm2
vblendvps ymm1, ymm4, ymm1, ymm3
vmovups ymmword ptr [rsi + 4*rbx], ymm1
add rbx, 8
cmp rdi, rbx
jne L1536
...
It’s quite clever, what this code does: It loads 256 bits of values at a time from each of the a
and b
input vectors (on my AVX2 laptop, newer CPUs would presumably work with 512 bits). That is 4x 64 bit floats, or 8x 32 bit floats. It then both multiplies and divides all values with each other, and finally selects (blends) either the product or quotient depending on the operation for each value.
Although only four 64 bit floats are processed at a time, and although the code does additional work (always both multiplies and divides), one might wonder why it’s a whopping 7 times faster than the non-SIMD version. The answer is that the code above is completely branchless (apart from the loop condition), whereas the non-SIMD version will incur constant costly branch misses. To see that this is the case, we can use a smaller vector, so that the CPU will learn the branching behavior during the benchmark:
julia> N = 1000; ops = rand([MUL, DIV], N);
julia> a, b, v = 1:3 .|>_-> rand(N);
julia> @btime $v .= evaluate_op.($ops, $a, $b);
1.211 μs (0 allocations: 0 bytes)
julia> @btime $v .= evaluate_op2.($ops, $a, $b);
572.929 ns (0 allocations: 0 bytes)
Now, without branch mispredictions, the SIMD version is only ~2x faster than the SIMD version, which intuitively makes sense (processes 4 values per iteration, but does 2x the amount of work).
And no, this particular pattern won’t scale very well if more operations are added.
Sorry, long post, probably irrelevant to the OP, I got a bit carried away there.