The Nxd (I think you meant 2\times N) array case should be 100x faster because all memory accesses happen in the optimal order, the element a[1,i] is exactly contiguous to a[2,i], where as a[i,1] and a[i,2] are at two distant locations.
Have an example you can point to here?
The issue is probably that the d is dynamic instead of known at compile time, while for the AoS it is at least known at compile time.
Being known at compile time allows for some fairly efficient shuffling to load/store the data.
If you want to vectorize a loop, the easiest way is to perform multiple loop iterations in parallel. That means what you want is a[i,1] and a[i+1,1] to be close in memory, and you also want a[i,2] and a[i+1,2] to be close in memory.
That is what you get from Nxd and from SoA.
With AoS, however, a[1,i] and a[1,i+1] aren’t close. But if d is known at compile time and the d “loop” or operations across d are unrolled, the compiler can use a sequence of shuffles/permutes/etc to transition from AoS to SoA (on load) and then back to AoS on store.
This of course adds a little overhead, but generally not too much when it works.
If the stride of the dxN array is not known, this makes it much less efficient/may prevent the optimization.
using BenchmarkTools, StructArrays
struct ComplexAoS
real::Float64
imag::Float64
end
struct ComplexSoA
real::Vector{Float64}
imag::Vector{Float64}
end
function normSoA(x)
out = Array{Float64,1}(undef, length(x.real))
@inbounds for i in 1:length(x.real)
out[i] = sqrt(x.real[i]^2 + x.imag[i]^2)
end
out
end
function normAoS(x)
out = Array{Float64,1}(undef, length(x))
@inbounds for i in 1:length(x)
out[i] = sqrt(x[i].real^2 + x[i].imag^2)
end
out
end
function norm_dN(x)
n = size(x,2);
out = Array{Float64,1}(undef, n)
@inbounds for i in 1:n
out[i] = sqrt(x[1,i]^2 + x[2,i]^2)
end
out
end
function norm_Nd(x)
n = size(x,1);
out = Array{Float64,1}(undef, n)
@inbounds for i in 1:n
out[i] = sqrt(x[i,1]^2 + x[i,2]^2)
end
out
end
N = 10_000_000;
arr_Nd = rand(N,2);
arr_dN = permutedims(arr_Nd);
arrAoS = [ComplexAoS(arr_Nd[i,1],arr_Nd[i,2]) for i in 1:N];
arrSoA = ComplexSoA(arr_Nd[:,1],arr_Nd[:,2]);
@btime normAoS($arrAoS);
@btime normSoA($arrSoA);
@btime norm_Nd($arr_Nd);
@btime norm_dN($arr_dN);
# [normAoS(arrAoS) normSoA(arrSoA) norm_Nd(arr_Nd) norm_dN(arr_dN)]
What is the proper Julia way to declare a fixed value for d in this example?
I want to write the general code for all d but state that “for this run please specialize the code for d = 2 only”
It is subtle to do something like this right and grasp the correctly implications, but I would go with:
# Generic function that does not specialize on 'd'
f(d :: Int, other_parameters) = ...
# Function that compiles for specific 'd'
f(::Val{d}, other_parameters) where {d} = f(d, other_parameters)
function main()
...
# now I want to call f specialized for a specific d
f(Val(2), other_parameters)
...
end
Note that if you have a dynamically sized array, you should probably also add an assert on the size, or something to tell LLVM that the stride == d.
Result from adding a D == stride(Y,2) || throw(...):
julia> @btime interleave0!($Y, Val(3)) # without the check
261.521 ns (0 allocations: 0 bytes)
341.0932625746682
julia> @btime interleave1!($Y, Val(3)) # with the check
146.241 ns (0 allocations: 0 bytes)
341.0932625746683
Nearly a 2x improvement from adding a check! =)
Full code:
julia> using BenchmarkTools
julia> function interleave0!(Y, ::Val{D}) where {D}
s = 0.0
@fastmath for i in axes(Y,2)
s += sqrt(mapreduce(abs2, +, ntuple(d -> @inbounds(Y[d,i]), Val(D))))
end
return s
end
interleave0! (generic function with 1 method)
julia> function interleave1!(Y, ::Val{D}) where {D}
# same as interleave0, except we add a check
D == stride(Y,2) || throw(ArgumentError("Unexpected stride value for Y!"))
s = 0.0
@fastmath for i in axes(Y,2)
s += sqrt(mapreduce(abs2, +, ntuple(d -> @inbounds(Y[d,i]), Val(D))))
end
return s
end
interleave1! (generic function with 1 method)
julia> Y = randn(3,200);
julia> @btime interleave0!($Y, Val(3)) # without an assert
261.521 ns (0 allocations: 0 bytes)
341.0932625746682
julia> @btime interleave1!($Y, Val(3)) # with the assert
146.241 ns (0 allocations: 0 bytes)
341.0932625746683
EDIT:
I initially used an @assert, but replaced it with the check & throw because at some point in the future, @asserts may gain the option to be disabled for performance.
If this replaces @asserts with @llvm.assumes, that’d be fine. But, if it just removes them, we’d suddenly get worse performance instead!
Hence, I figured I’d go with throw instead.
Yes.
LLVM actually generates nice looking vector code for interleave0! – behind a check if stride(Y,2) == 1. Oops.
Of course, stride(Y,2) != 1, so this branch and the nice looking vector code will never be used.
The throw/manual branch tells it that stride(Y,2) == D, so from there it generates much less nice looking code that assumes/requires stride(Y,2) == D, but is still much faster than the scalar code the first version has to take (even though the vector sqrt’s throughput doesn’t actually scale with vector size on my computer…).
Knowing stride(Y,2) == D lets it load contiguous vectors and then permute them to go AoS to SoA efficiently.
If stride(Y,2) == 1, that of course isn’t necessary, which is why the code from interleave0! looks better at first glance, despite of course being unusable and thus just adding a check and increasing code size.