Nested for loops vs Iterators.product performance

Nichola · September 14, 2024, 11:03pm

Assume I have some vectors a,b,c… and I want to compute
\sum_{ijk..} a_i b_j c_k ...

For 3 lists, I can do it using nested for loops :

function sumprod1(a, b, c)
    s = 0
    for i in 1:length(a)
        for j in 1:length(b)
            for k in 1:length(c)
                s += a[i]*b[j]*c[k]
            end
        end
    end
    return s
end

Or wite a general function that make use of Iterators.product :

function sumprod2(lists::Vararg{Vector{Float64}})
    s = 0
    p = [length(list) for list in lists]
    ranges = [1:i for i in p]
    for inds in Iterators.product(ranges...)
        c = [lists[i][inds[i]] for i in 1:length(inds)]
        s += reduce(*, c)
    end
    return s
end

The problem is that the general solution is much slower than the nested for loops:

N = 2^8
a = rand(N)
b = rand(N)
c = rand(N)
@time println(sumprod1(a,b,c))
@time println(sumprod2(a,b,c))

returns

  0.040710 seconds (9.22 k allocations: 719.312 KiB, 62.68% compilation time)
  8.615370 seconds (151.25 M allocations: 7.267 GiB, 2.29% gc time, 1.29% compilation time)

How can I understand this ? Is there a way to speedup the general solution ?

stevengj · September 14, 2024, 11:17pm

You have no hope of coming close to the for loops here, because length(ranges) in your code is not known until runtime, so the Iterators.product is type-unstable.

If you want to have decent performance with Iterators.product, you need to use containers whose lengths (i.e. the “dimensionality” of your loop) are known at compile time, like tuples.

Nichola · September 14, 2024, 11:31pm

I see that most of the time is spent in lists[i][inds[i]] . Could it be that the problem comes from not knowing the order in which the vectors are iterated over ? Could there be a way to specify they will be iterated over in sequential order ?
Also the second option allocates way more memory than the first one, maybe suggesting that its not using the cache ? Can I force cache usage ?

nsajko · September 14, 2024, 11:49pm

The length of lists is known from type-level information, but this expression “loses” that quality, the length of p perhaps isn’t known to the compiler. Also it allocates unnecessarily. Same with ranges and c.

Try something like this instead:

function sumprod2(lists::Vararg{Vector{Float64}})
    p = map(length, lists)
    to_range = n -> Base.OneTo(n)
    ranges = map(to_range, p)
    s = 0
    for inds in Iterators.product(ranges...)
        c = (lists[i][inds[i]] for i in 1:length(inds))
        s += reduce(*, c)
    end
    return s
end

danielwe · September 14, 2024, 11:51pm

Everything required for a fast version is known at compile time, but with varargs (the list... argument) you sometimes need to force specialization using the type annotation Vararg{T,N} where {N}. Also, your implementation had issues with excessive allocation of intermediate arrays, as well as initializing s to a zero of the wrong type (0 is an integer zero). Here’s a fast version:

function sumprod3(lists::Vararg{Any,N}) where {N}
    s = prod(zero ∘ eltype, lists)
    for inds in Iterators.product(eachindex.(lists)...)
        s += prod(lists[i][inds[i]] for i in eachindex(lists, inds))
    end
    return s
end

Benchmarked against a version of sumprod1 with the initialization issue fixed:

julia> using BenchmarkTools

julia> @btime sumprod1fixed($a, $b, $c)
  35.544 ms (0 allocations: 0 bytes)
2.067942582493859e6

julia> @btime sumprod3($a, $b, $c)
  36.265 ms (0 allocations: 0 bytes)
2.0679425824940344e6

Definition of sumprod1fixed

function sumprod1fixed(a, b, c)
    s = zero(eltype(a)) * zero(eltype(b)) * zero(eltype(c))
    for i in eachindex(a), j in eachindex(b), k in eachindex(c)
        s += a[i] * b[j] * c[k]
    end
    return s
end

nsajko · September 15, 2024, 12:02am

Avoiding the use of an iterator here could give possible further perf improvement. Something like this:

function sumprod4(lists::Vararg{Any,N}) where {N}
    s = prod(zero ∘ eltype, lists)
    for inds in Iterators.product(eachindex.(lists)...)
        s += mapreduce(getindex, *, lists, inds)
    end
    s
end

danielwe · September 15, 2024, 12:05am

That version, verbatim, was my first attempt, but it’s a hair slower, which is why I went with the generator. Julia’s mapreduce appears not to be fully optimized in general.

julia> @btime sumprod4($a, $b, $c)
  40.655 ms (0 allocations: 0 bytes)
2.0679425824940344e6

Nichola · September 15, 2024, 12:06am

Awesome, thank you

nsajko · September 15, 2024, 12:14am

FWIW, on the laptop in my lap, with nightly Julia, both versions seem equally fast (13.4 ms):

julia> using BenchmarkTools

julia> (@benchmark sumprod3(a, b, c) setup=(N = 2^8; a = rand(N); b = rand(N); c = rand(N)));

julia> median(ans).time
1.33713225e7

julia> (@benchmark sumprod4(a, b, c) setup=(N = 2^8; a = rand(N); b = rand(N); c = rand(N)));

julia> median(ans).time
1.336118e7

danielwe · September 15, 2024, 12:20am

Interesting. The difference (~36 ms. vs ~40 ms) is reproducible on my laptop, also when sampling different a, b, c per sample like you did (and using battery saver mode to avoid thermal throttling).

nsajko · September 15, 2024, 12:21am

FTR providing an init keyword argument to the mapreduce in the loop body in sumprod4 achieves a further, small, speedup. Or maybe it was just noise, lol.

nsajko · September 15, 2024, 12:23am

FWIW:

julia> versioninfo()
Julia Version 1.12.0-DEV.1205
Commit 346f38bceab (2024-09-14 19:40 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

danielwe · September 15, 2024, 12:33am

Maybe an improvement between 1.10.5 and 1.12 then, I’m not quite as bleeding edge as you (also I have an Intel i7, but I don’t suppose Intel vs. AMD would matter for the ability to optimize mapreduce relative to generators)

foobar_lv2 · September 16, 2024, 8:41am

Have you considered re-distributing into sumprodX(vecs...) = prod((sum(vec) for vec in vecs))?

julia> @btime sumprod1($a,$b,$c)
  22.297 ms (0 allocations: 0 bytes)
2.2209763892047782e6

julia> @btime sumprodX($a,$b,$c)
  104.608 ns (0 allocations: 0 bytes)
2.220976389204965e6

stevengj · September 16, 2024, 12:44pm

Even simpler: prod(sum, vecs)

foobar_lv2 · September 16, 2024, 1:34pm

To clarify a little: The product-of-sums vs sum-of-product formulation is not so much about coding style or type inference, it’s about complexity class (O(n*k) instead of O(n^k), for k vectors of n elements each).

It’s not clear from your question whether that is relevant to your real setting – might as well be that you have a different problem, and you just chose an ill-considered stand-in / model problem for discourse (ill-considered because it admits structure and optimizations that are not present in your real problem). Choosing the right simplified toy model is hard!

Nichola · September 16, 2024, 1:40pm

Yes exaclty, it is not relevant to my problem. I have a different problem and this sum-of-product was a simplified toy model.

Topic		Replies	Views
Work around splatting to avoid unecessary allocations Performance	2	2106	January 16, 2019
Iterators.product is not type stable? General Usage type-stability , iterators , cartesianindices	20	3134	September 9, 2018
How to achieve the greatest speed with nested loops? Performance	10	1946	July 24, 2020
Threads over iterator products General Usage parallel	10	1481	June 24, 2021
Making iterators as fast as loops General Usage question	25	4087	June 14, 2020

Nested for loops vs Iterators.product performance

Related topics