Nested for loops vs Iterators.product performance

Assume I have some vectors a,b,c… and I want to compute
\sum_{ijk..} a_i b_j c_k ...

For 3 lists, I can do it using nested for loops :

function sumprod1(a, b, c)
    s = 0
    for i in 1:length(a)
        for j in 1:length(b)
            for k in 1:length(c)
                s += a[i]*b[j]*c[k]
            end
        end
    end
    return s
end

Or wite a general function that make use of Iterators.product :

function sumprod2(lists::Vararg{Vector{Float64}})
    s = 0
    p = [length(list) for list in lists]
    ranges = [1:i for i in p]
    for inds in Iterators.product(ranges...)
        c = [lists[i][inds[i]] for i in 1:length(inds)]
        s += reduce(*, c)
    end
    return s
end

The problem is that the general solution is much slower than the nested for loops:

N = 2^8
a = rand(N)
b = rand(N)
c = rand(N)
@time println(sumprod1(a,b,c))
@time println(sumprod2(a,b,c))

returns

  0.040710 seconds (9.22 k allocations: 719.312 KiB, 62.68% compilation time)
  8.615370 seconds (151.25 M allocations: 7.267 GiB, 2.29% gc time, 1.29% compilation time)

How can I understand this ? Is there a way to speedup the general solution ?

You have no hope of coming close to the for loops here, because length(ranges) in your code is not known until runtime, so the Iterators.product is type-unstable.

If you want to have decent performance with Iterators.product, you need to use containers whose lengths (i.e. the “dimensionality” of your loop) are known at compile time, like tuples.

2 Likes

I see that most of the time is spent in lists[i][inds[i]] . Could it be that the problem comes from not knowing the order in which the vectors are iterated over ? Could there be a way to specify they will be iterated over in sequential order ?
Also the second option allocates way more memory than the first one, maybe suggesting that its not using the cache ? Can I force cache usage ?

The length of lists is known from type-level information, but this expression “loses” that quality, the length of p perhaps isn’t known to the compiler. Also it allocates unnecessarily. Same with ranges and c.

Try something like this instead:

function sumprod2(lists::Vararg{Vector{Float64}})
    p = map(length, lists)
    to_range = n -> Base.OneTo(n)
    ranges = map(to_range, p)
    s = 0
    for inds in Iterators.product(ranges...)
        c = (lists[i][inds[i]] for i in 1:length(inds))
        s += reduce(*, c)
    end
    return s
end

Everything required for a fast version is known at compile time, but with varargs (the list... argument) you sometimes need to force specialization using the type annotation Vararg{T,N} where {N}. Also, your implementation had issues with excessive allocation of intermediate arrays, as well as initializing s to a zero of the wrong type (0 is an integer zero). Here’s a fast version:

function sumprod3(lists::Vararg{Any,N}) where {N}
    s = prod(zero ∘ eltype, lists)
    for inds in Iterators.product(eachindex.(lists)...)
        s += prod(lists[i][inds[i]] for i in eachindex(lists, inds))
    end
    return s
end

Benchmarked against a version of sumprod1 with the initialization issue fixed:

julia> using BenchmarkTools

julia> @btime sumprod1fixed($a, $b, $c)
  35.544 ms (0 allocations: 0 bytes)
2.067942582493859e6

julia> @btime sumprod3($a, $b, $c)
  36.265 ms (0 allocations: 0 bytes)
2.0679425824940344e6
Definition of sumprod1fixed
function sumprod1fixed(a, b, c)
    s = zero(eltype(a)) * zero(eltype(b)) * zero(eltype(c))
    for i in eachindex(a), j in eachindex(b), k in eachindex(c)
        s += a[i] * b[j] * c[k]
    end
    return s
end
9 Likes

Avoiding the use of an iterator here could give possible further perf improvement. Something like this:

function sumprod4(lists::Vararg{Any,N}) where {N}
    s = prod(zero ∘ eltype, lists)
    for inds in Iterators.product(eachindex.(lists)...)
        s += mapreduce(getindex, *, lists, inds)
    end
    s
end

That version, verbatim, was my first attempt, but it’s a hair slower, which is why I went with the generator. Julia’s mapreduce appears not to be fully optimized in general.

julia> @btime sumprod4($a, $b, $c)
  40.655 ms (0 allocations: 0 bytes)
2.0679425824940344e6
1 Like

Awesome, thank you

FWIW, on the laptop in my lap, with nightly Julia, both versions seem equally fast (13.4 ms):

julia> using BenchmarkTools

julia> (@benchmark sumprod3(a, b, c) setup=(N = 2^8; a = rand(N); b = rand(N); c = rand(N)));

julia> median(ans).time
1.33713225e7

julia> (@benchmark sumprod4(a, b, c) setup=(N = 2^8; a = rand(N); b = rand(N); c = rand(N)));

julia> median(ans).time
1.336118e7
1 Like

Interesting. The difference (~36 ms. vs ~40 ms) is reproducible on my laptop, also when sampling different a, b, c per sample like you did (and using battery saver mode to avoid thermal throttling).

FTR providing an init keyword argument to the mapreduce in the loop body in sumprod4 achieves a further, small, speedup. Or maybe it was just noise, lol.

FWIW:

julia> versioninfo()
Julia Version 1.12.0-DEV.1205
Commit 346f38bceab (2024-09-14 19:40 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

Maybe an improvement between 1.10.5 and 1.12 then, I’m not quite as bleeding edge as you (also I have an Intel i7, but I don’t suppose Intel vs. AMD would matter for the ability to optimize mapreduce relative to generators)

1 Like

Have you considered re-distributing into sumprodX(vecs...) = prod((sum(vec) for vec in vecs))?

julia> @btime sumprod1($a,$b,$c)
  22.297 ms (0 allocations: 0 bytes)
2.2209763892047782e6

julia> @btime sumprodX($a,$b,$c)
  104.608 ns (0 allocations: 0 bytes)
2.220976389204965e6
1 Like

Even simpler: prod(sum, vecs)

5 Likes

To clarify a little: The product-of-sums vs sum-of-product formulation is not so much about coding style or type inference, it’s about complexity class (O(n*k) instead of O(n^k), for k vectors of n elements each).

It’s not clear from your question whether that is relevant to your real setting – might as well be that you have a different problem, and you just chose an ill-considered stand-in / model problem for discourse (ill-considered because it admits structure and optimizations that are not present in your real problem). Choosing the right simplified toy model is hard!

3 Likes

Yes exaclty, it is not relevant to my problem. I have a different problem and this sum-of-product was a simplified toy model.