Assume I have some vectors a,b,c… and I want to compute \sum_{ijk..} a_i b_j c_k ...
For 3 lists, I can do it using nested for loops :
function sumprod1(a, b, c)
s = 0
for i in 1:length(a)
for j in 1:length(b)
for k in 1:length(c)
s += a[i]*b[j]*c[k]
end
end
end
return s
end
Or wite a general function that make use of Iterators.product :
function sumprod2(lists::Vararg{Vector{Float64}})
s = 0
p = [length(list) for list in lists]
ranges = [1:i for i in p]
for inds in Iterators.product(ranges...)
c = [lists[i][inds[i]] for i in 1:length(inds)]
s += reduce(*, c)
end
return s
end
The problem is that the general solution is much slower than the nested for loops:
N = 2^8
a = rand(N)
b = rand(N)
c = rand(N)
@time println(sumprod1(a,b,c))
@time println(sumprod2(a,b,c))
returns
0.040710 seconds (9.22 k allocations: 719.312 KiB, 62.68% compilation time)
8.615370 seconds (151.25 M allocations: 7.267 GiB, 2.29% gc time, 1.29% compilation time)
How can I understand this ? Is there a way to speedup the general solution ?
You have no hope of coming close to the for loops here, because length(ranges) in your code is not known until runtime, so the Iterators.product is type-unstable.
If you want to have decent performance with Iterators.product, you need to use containers whose lengths (i.e. the “dimensionality” of your loop) are known at compile time, like tuples.
I see that most of the time is spent in lists[i][inds[i]] . Could it be that the problem comes from not knowing the order in which the vectors are iterated over ? Could there be a way to specify they will be iterated over in sequential order ?
Also the second option allocates way more memory than the first one, maybe suggesting that its not using the cache ? Can I force cache usage ?
The length of lists is known from type-level information, but this expression “loses” that quality, the length of p perhaps isn’t known to the compiler. Also it allocates unnecessarily. Same with ranges and c.
Try something like this instead:
function sumprod2(lists::Vararg{Vector{Float64}})
p = map(length, lists)
to_range = n -> Base.OneTo(n)
ranges = map(to_range, p)
s = 0
for inds in Iterators.product(ranges...)
c = (lists[i][inds[i]] for i in 1:length(inds))
s += reduce(*, c)
end
return s
end
Everything required for a fast version is known at compile time, but with varargs (the list... argument) you sometimes need to force specialization using the type annotation Vararg{T,N} where {N}. Also, your implementation had issues with excessive allocation of intermediate arrays, as well as initializing s to a zero of the wrong type (0 is an integer zero). Here’s a fast version:
function sumprod3(lists::Vararg{Any,N}) where {N}
s = prod(zero ∘ eltype, lists)
for inds in Iterators.product(eachindex.(lists)...)
s += prod(lists[i][inds[i]] for i in eachindex(lists, inds))
end
return s
end
Benchmarked against a version of sumprod1 with the initialization issue fixed:
julia> using BenchmarkTools
julia> @btime sumprod1fixed($a, $b, $c)
35.544 ms (0 allocations: 0 bytes)
2.067942582493859e6
julia> @btime sumprod3($a, $b, $c)
36.265 ms (0 allocations: 0 bytes)
2.0679425824940344e6
Definition of sumprod1fixed
function sumprod1fixed(a, b, c)
s = zero(eltype(a)) * zero(eltype(b)) * zero(eltype(c))
for i in eachindex(a), j in eachindex(b), k in eachindex(c)
s += a[i] * b[j] * c[k]
end
return s
end
Avoiding the use of an iterator here could give possible further perf improvement. Something like this:
function sumprod4(lists::Vararg{Any,N}) where {N}
s = prod(zero ∘ eltype, lists)
for inds in Iterators.product(eachindex.(lists)...)
s += mapreduce(getindex, *, lists, inds)
end
s
end
That version, verbatim, was my first attempt, but it’s a hair slower, which is why I went with the generator. Julia’s mapreduce appears not to be fully optimized in general.
FWIW, on the laptop in my lap, with nightly Julia, both versions seem equally fast (13.4 ms):
julia> using BenchmarkTools
julia> (@benchmark sumprod3(a, b, c) setup=(N = 2^8; a = rand(N); b = rand(N); c = rand(N)));
julia> median(ans).time
1.33713225e7
julia> (@benchmark sumprod4(a, b, c) setup=(N = 2^8; a = rand(N); b = rand(N); c = rand(N)));
julia> median(ans).time
1.336118e7
Interesting. The difference (~36 ms. vs ~40 ms) is reproducible on my laptop, also when sampling different a, b, c per sample like you did (and using battery saver mode to avoid thermal throttling).
FTR providing an init keyword argument to the mapreduce in the loop body in sumprod4 achieves a further, small, speedup. Or maybe it was just noise, lol.
Maybe an improvement between 1.10.5 and 1.12 then, I’m not quite as bleeding edge as you (also I have an Intel i7, but I don’t suppose Intel vs. AMD would matter for the ability to optimize mapreduce relative to generators)
To clarify a little: The product-of-sums vs sum-of-product formulation is not so much about coding style or type inference, it’s about complexity class (O(n*k) instead of O(n^k), for k vectors of n elements each).
It’s not clear from your question whether that is relevant to your real setting – might as well be that you have a different problem, and you just chose an ill-considered stand-in / model problem for discourse (ill-considered because it admits structure and optimizations that are not present in your real problem). Choosing the right simplified toy model is hard!