[ANN] LoopVectorization

Probably not. I tried, but I think at least the JIT did not work.

EDIT:
@tkf I don’t want to hijack your ThreadsX.jl thread to talk about LoopVectorization.jl, so I figured it’d be better to move that part of the discussion here (unless you think somewhere else would be more appropriate).

I am assuming that the work load per element is more-or-less constant, but perhaps I shouldn’t assume the outer most axis is much longer than Threads.nthreads(). At least one of the loops much be much longer, if threading’s overhead is going to be amortized. It’d be unfortunate if, e.g., there were two loops that were 3 and 3,000,000 iterations respectively, and LoopVectorization partitioned picked the first to partition and thread on a many-threaded CPU.
Maybe they could pass the three as a compile-time constant, in which case I could have it check to make sure loops are at least a certain length relative to Sys.CPUTHREADS or VectorizationBase.NUM_CORES. FWIW, it currently treats dynamically sized loops as being of length 1024 when making optimization decisions.

The reason I’d generally prefer to pick a relatively outer loop is to reduce overhead.
As one example, when I get around to parallelizing matrix multiplication, it will not happen within LoopVectorization (or at least, not until it models cache/memory bandwidth). Instead, PaddedMatrices.jmul! currently wraps 3 additional loops around the @avx loop to add extra tiling and pack blocks from A and B.

Maybe I should read literature to find out how others break add threading, but as a guess, perhaps @spawn on both the outermost loop, and the second-from-outer-most loop (the M and N loops)?
I’m inclined to try that, and look at literature if it’s still slower by the time I run out of ideas.

Is the idea of halve to use a recursive “cache-oblivious” approach, until the length of the iterable is below some threshold?

Probably worth considering different ways of iterating. Currently, LoopVectorization assumes you’re iterating over an AbstractUnitRange or CartesianIndices{N}, and it generates corresponding nested while loops (N of them, for the CartesianIndices).
The natural thing, if continuing that approach, would be to split the threaded loop into two nested loops (outer of nchunk, inner over chunksize), and @spawn the inner loop.

LoopVectorization currently doesn’t support non-rectangular iteration, and it also doesn’t support loop-carried dependencies; it currently assumes that it can arbitrarily rearrange the order of evaluations.
However, I would like it to eventually be able to generate kernels for things like solving triangular systems of equations, which would require lifting both of those restrictions.
It’ll take more time than I’ll be able to dedicate for some time to really figure out how to do this.

But that’s something to at least start holding in mind now.
Handling loop carried dependencies would mean that I can’t just naively carve up the iteration space. For cases like A / L (where L is a lower or upper triangular matrix, and A is an arbitrary strided matrix), there is an axis I can freely partition (rows of A). But what about something like a Cholesky decomposition?

Hmm. @avx actually converts the expression into a call to _avx_!, a generated function.
So, perhaps the approach to take would be to carve up the iteration space, and then repeatedly call @spawn _avx_!(args...). This will handle the unrolling and vecorization, as well as loops around the unrolled code.
The _avx_! function also returns any reductions (in vector form), so I could then fetch and combine them. This handles the scoping problem I was worried about. Hopefully this is better than writing to sum buffer and then loading.

Also, FWIW, LoopVectorization currently unrolls up to 3 loops (e.g., it will unroll all three loops for A' * B), and it will unroll both in the 3-arg dot product. You want to unroll the m loop to vectorize it, but unrolling the n loop by U also lets you reuse each x[m] you’ve loaded U times before replacing it.
This is critical to performance, because loading memory is the bottleneck, not arithmetic, even when the memory is aligned and fits in cache.
vfmadd is as fast as an aligned load, and twice as fast as an unaligned load. You already have 1 load from A per vfmadd, hence the importance of minimizing loads from x (loads from y get hoisted out of the loop, so they’re fine).
You also need the multiplications with y. the Julia 1.4+ implementation, lifts the multiplication by y[n] out of the loop, but I didn’t see real benefit when trying that with LoopVectorization, why is much faster over a range of sizes. The arithmetic isn’t the bottleneck.

2 Likes