Why are vectorized operations faster in julia?

simd
#1

In case of numpy, vectorization offloads heavy computation to the code in compiled libraries, improving speed. Julia by default compiles down to machine code, why then is this implementation of reduce operation faster than the naive implementation:

function mapreduce_impl(f, op::Union{typeof(max), typeof(min)},
                        A::AbstractArray, first::Int, last::Int)
    a1 = @inbounds A[first]
    v1 = mapreduce_first(f, op, a1)
    v2 = v3 = v4 = v1
    chunk_len = 256
    start = first + 1
    simdstop  = start + chunk_len - 4
    while simdstop <= last - 3
        # short circuit in case of NaN
        v1 == v1 || return v1
        v2 == v2 || return v2
        v3 == v3 || return v3
        v4 == v4 || return v4
        @inbounds for i in start:4:simdstop
            v1 = _fast(op, v1, f(A[i+0]))
            v2 = _fast(op, v2, f(A[i+1]))
            v3 = _fast(op, v3, f(A[i+2]))
            v4 = _fast(op, v4, f(A[i+3]))
        end
        checkbounds(A, simdstop+3)
        start += chunk_len
        simdstop += chunk_len
    end
    v = op(op(v1,v2),op(v3,v4))
    ...
    return v

The variable names suggest that it has something to do with simd. How does using four variables instead of one allow simd optimizations to take place?

1 Like
#2

The simple answer (even before simd) boils down to dependency chains: Your CPU can do a lot of work in every single cycle, but most operations take multiple cycles until their result is ready. Hence, a*(b*(c*d)) is often slower than (a*b)*(c*d) (for generalized reducers like *, +, min, max, hash, etc). “sequential execution” is a lie: Your CPU is actually a very complex, badly documented and highly parallel beast that interprets the high-level language known as “assembly”.

Then, simd. For many operations, the cpu can operate on large vectors (128-512 bit) at the same time, and this is especially useful if these vectors are contiguous in memory. For some operations, there are even built-in reductions running on a (128-512 bit) vector.

In between, there is the compiler (llvm), the floating point semantics offered by the cpu, and the floating point semantics guaranteed by julia. They all subtly mismatch!

For example, julia’s min/max (as well as naive mathematical min/max) are associative and commutative, which is the best possible world for clever reorderings. Unfortunately, llvm is notoriously bad at figuring this out. It can optimize integer arithmetic and @fastmath float products and sums, but fails at float min/max reassociation (regardless of @fastmath). Secondly, the x86 min/max instructions on floating points have subtly different behavior with respect to NaN (and maybe signed zeros?) than julia. And third, due to the way @fastmath is implemented, there is currently no sensible way of opting out of julia-correct handling of corner cases, in order to get the blazingly fast pure simd minimum/maximum.

The above code is a compromise that is still correct, and reliably uses 128 bit wide simd and maybe sometimes gets 256 bit simd right, but is unfortunately quite ugly and probably suboptimal on weak arm. Source: Long discussions on github when the above code got merged. The above is much faster and uglier than what we had before.

I think there is some PR in llvm underway that teaches min/max commutativity and associativity to llvm. With that, everything will become nicer.

13 Likes
#3

Further reading: https://github.com/JuliaLang/julia/pull/30320
…and the upstream LLVM fix: https://reviews.llvm.org/rL345220

1 Like
#4

There are two meanings of the word “vectorization” in common usage, and they refer to different things. When we talk about “vectorized” code in Python/Numpy/Matlab/etc., we are usually referring to the fact that code like:

x = [1, 2 3]
y = x + 1

is faster than:

x = [1, 2, 3]
for i in 1:3  
  y[i] = x[i] + 1
end

This kind of vectorization is helpful in languages like Python and Matlab because every operation in Python is slow. Every loop iteration, every call to +, every array lookup, etc. has an inherent overhead from the way the language works. So in Python and Matlab, it’s faster to “vectorize” your code by, for example, only paying the cost of looking up the + operation once for the entire vector x rather than once for each element x[i].

Julia does not have this problem. In Julia, y .= x .+ 1 and for i in 1:3; y[i] = x[i] + 1; end compile down to almost exactly the same code and perform comparably. So the kind of vectorization you need in Python and Matlab is not necessary in Julia.

The other kind of vectorization refers to SIMD operations (discussed above), and refers to your CPUs ability to perform some operations on multiple values in a single clock cycle. Julia benefits from this just as much as any other fast language.

10 Likes