I’ve written two reductions which both do the same thing: compute the sum and the squared norm of a vector. One does this in one pass through the data, accumulating two values as it goes. The second does two passes through the data, accumulating a single, scalar value each time. The second approach, as implemented below, is currently more than 3x faster, on my machine (and only for vector sizes that fit in cache). This feels like it shouldn’t be the case - it should be faster with the single pass through.
using BenchmarkTools, StaticArrays
x_xx(x) = SVector(x, x * x)
xx(x) = x * x
sum1(vec::AbstractVector{T}) where T = sum(x_xx, vec; init=SVector(T(0), T(0)))
sum2(vec) = SVector(sum(vec), sum(xx, vec))
function time()
A = randn(2^13)
@btime sum1($A) # A single, combined vector reduction
@btime sum2($A) # Two vector reductions
return
end
Is there any way to improve my implementation of the single reduction to make it faster? It’s frustrating at present, because once the vector can’t fit in the cache, the single pass through approach becomes faster, so there is some point at which it is worth switching algorithm.
It’s usually that the N-pass can use SIMD and SIMD factor is greater than N.
a few things are unfortunately like this in Julia:
julia> const a = rand(10^6);
julia> @be minimum(a)
Benchmark: 342 samples with 1 evaluation
min 264.215 μs
median 268.488 μs
mean 277.420 μs
max 698.978 μs
julia> @be extrema(a)
Benchmark: 39 samples with 1 evaluation
min 2.560 ms
median 2.569 ms
mean 2.571 ms
max 2.597 ms
extrema is slower than taking minimum & maximum.
another examples is findmin can be slower than minimum & findfirst(==(min_ele), )
From what you’re saying, I’ve understood that there is currently no better way of doing a parallel reduction in Julia, currently. However, I’m sure there could be: each SIMD chunk could be loaded, then each reduction applied serially, in the same way the scalar reduction is done. This way you save extra loads, and potential cache misses, and the rest of the code is the same. I can see that this might require a different interface to the reduction wrapper, though.
I’m not very surprised by this. If memory bandwidth is the limiting factor, it will typically pay to do more operations for each memory access. When everything is in cache, it’s not that big deal.
sum2 does two passes, each with a single value being accumulated. That means it’s easy to vectorize the operations (a.k.a. SIMD), so 4 or 8 or more of the additions can be done in parallel. Also, it might be that the sum function has some specializations for such simple vectors (or mapreduce has, which sum typically uses for such vectors).
The reduction machinery that underpins sum is complicated and has lots of possible paths depending on all sorts of things, some optimizing performance, some optimizing numerical accuracy, some doing both with varying levels of success. It’s not hard to end up in a less-than-optimal path, depending on your criteria of optimal. It can definitely be improved, especially if you have strong opinions on what “optimal” means to you.
In this case, this isn’t a fundamental limitation for my processor — but it definitely might be if the “fused” operations are complicated enough.
julia> @btime sum1($A);
7.646 μs (0 allocations: 0 bytes)
julia> @btime sum2($A);
3.385 μs (0 allocations: 0 bytes)
julia> function sum3(A)
r = zero(eltype(A))
r2 = r*r
@simd for elt in A
r += elt
r2 += elt*elt
end
return (r, r2)
end
sum3 (generic function with 1 method)
julia> @btime sum3($A);
2.472 μs (0 allocations: 0 bytes)
Just a note on terminology here — parallel reduction and pairwise reduction are both fairly well established terms of art that are about the ordering of elements in the reduction and not about the operation itself. I’d call this a “fused” operation, but I’m not sure if there’s more standard terminology.
Stripping away sum also makes the problem much more obvious:
julia> function sum4(A)
r = SVector(zero(eltype(A)), zero(eltype(A)))
@simd for elt in A
r += SVector(elt, elt*elt)
end
return r
end
sum4 (generic function with 1 method)
julia> @btime sum4($A);
7.646 μs (0 allocations: 0 bytes)
This ends up looking very similar to sum3 in the IR, but the compiler needs to be able to identify reduction variables and handle them specially. I don’t know if we can or how to support such composite reduction variables right now.
I was thinking about having a separate scalar function for each reduction, such as:
f1(x) = x
f2(x) = x * x
sum(f1, f2, vec)
But if your functions need to recompute intermediate values, it’s not optimal. Unless they’re inlined and the compiler can remove the redundancies. Anyway, it maybe overcomplicates things, so is likely not worth it.
Aside from the issues discussed in this thread, I don’t see any reason to use SVector rather than Tuple. You’re not using any linear algebra functionality, so then SVector is just a redundant wrapper around Tuple.
When doing reduction the operation is assumed to be associative. This allows the known parallel implementations. But here, the crucial bit is using SIMD better and to do that commutativity is important (which would allow reordering the summations across SIMD wide channels).
Without checking the code, I would assume specialized sum for this case does use commutative SIMD, but generic reduce or @simd for may fail to know and use the commutativity (and shouldn’t assume it).
Perhaps this is an example where traits can come in handy.