Reduce allocations in 'reduce + eachcol' call to multiply several columns elementwise

Hi, I find that using reduce + eachcol in the following allocates more than I would like. The point is to element-wise multiply some columns of X. I’m getting n allocations where n is the number of rows when using reduce + eachcol. Using prod(..., dims = 2) allocates even more. I know it is not much, but in my code a similar step is repeated very often, so if I can reduce allocations, that would be great!

using BenchmarkTools

X = rand(1000, 3)
function TestFun(X)
    @views X[:, 1] .= reduce(.*, eachcol(X))
    #@views X[:, 1] .= prod(X, dims = 2)
    return X
end

@benchmark TestFun(X)

I’m aware of the following by the way, which reduces the number of allocations to 1. However, this feels to me like there would be an even easier/notationally nicer way to do this…

function TestFun2(X)
    tmp = ones(size(X, 1))
    for i in eachcol(X)
        tmp .= tmp .* i
    end
    @views X[:, 1] .= tmp
    return X
end

Hi,

The problem with TestFun is that the reduce will first create a Vector, which then gets copied over into X. Ideally, there would be an in-place version reduce! but that does not seem to be the case. There is broadcast! though, so you could use for example

function TestFun3(X)
    broadcast!(prod, @view(X[:, 1]), eachrow(X))
    return X
end

(By the way, note that the usual naming convention would be testfun3! or test_fun_3!, or something similar. Further, I think it’s safer to have a preallocated Y = Array{Float64}(undef, size(X, 1)) ready, and use a two-argument version of the function, where you write to Y instead of the view. But in any case, writing to the view seems to work fine here. You might also have good reasons to want to write back to X[:, 1] directly.)

Your TestFun2 (i.e. explicit loop approach) can also be simplified and improved:

function TestFun4(X)
    for j in 2:size(X, 2), i in axes(X, 1)
        X[i, 1] *= X[i, j]
    end
    return X
end

As for the resulting performance, I get:

julia> @btime TestFun($X);
  12.211 μs (3 allocations: 15.92 KiB)
julia> @btime TestFun2($X);
  18.300 μs (1 allocation: 7.94 KiB)
julia> @btime TestFun3($X);
  22.000 μs (0 allocations: 0 bytes)
julia> @btime TestFun4($X);
  11.900 μs (0 allocations: 0 bytes)

The reason why TestFun3 is slower, is probably due to us not using X in memory-order.* The nice thing about TestFun4 is that is very straightforward, and explicitly allows us to control the order of the nested loops, for better memory access patterns.

* EDIT: Declaring X = rand(3, 1000) and using eachcol(X) and X[1, :] in TestFun3 is roughly as fast, so this will not the correct explanation as for why TestFun3 is slower. In any case, you might also want to think about how you want to order the data in X in your actual code.

1 Like

In this case, I see decent performance and only one allocation from prod! a la

julia> prod!(similar(@view(X[:,1:1])), X)

If you are feeling extremely brave, you can drop the similar and this will run entirely in place. IT MAY NOT PRODUCE THE CORRECT RESULT WITHOUT THE similar. In one test of mine it worked no it didn’t, I was just sloppy about reading the result, but it could break in any version for any input because the output and input alias which is not promised to work. The above-suggested TestFun3 based on broadcast!into a view has this same issue. ONE SHOULD REALLY NOT DO THIS.

The TestFun4 is the best implementation. Don’t be afraid to write your own loops!

I don’t know why prod(...;dims) is notably worse than prod! (even when we count allocating the result for prod!). Hopefully that is improved sometime. PRs welcome! Although I suspect this is a complicated one to improve in general.

1 Like

On my machine there is no performance difference between both:

julia> function TestFunProd(X)
           @views X[:, 1] .= prod(X, dims = 2)  # (By the way, like in TestFun itself, the @views is useless here)
           return X
       end;

julia> function TestFunProd!(X)  # (I'm aware this is a confusing name, see note in my first post)
           X[:, 1] .= prod!(similar(@view(X[:, 1:1])), X)
           return X
       end

julia> @btime TestFunProd($X);
  17.900 μs (5 allocations: 8.02 KiB)

julia> @btime TestFunProd!($X);
  17.800 μs (1 allocation: 7.94 KiB)

So if TestFunProd is indeed noticeably slower for you, then I certainly agree that it will be complicated to improve prod(...; dims) in general :slight_smile: .

versioninfo
julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

Without the similar I actually get different, i.e. incorrect, results here (whereas for TestFun3’s broadcast! it was fine for me).

julia> X = rand(1, 3)
0.909626  0.930354  0.564601

julia> function TestFunProd!Unsafe(X)
           prod!(@view(X[:, 1]), X)  # (or 1:1, gives the same results)
           return X
       end;

julia> prod(X)
0.47780729264507804

julia> TestFunProd!Unsafe(X)
1×3 Matrix{Float64}:
 0.525279  0.930354  0.564601

The X[:, 1] entries seem to be ignored for the product. I haven’t looked at the source code for prod!, but I can imagine that for prod!(Y, X) you initialise Y at ones, and then start accumulating factors from X, which would explain the incorrect results.

In any case, it emphasises your warning and indeed the one explicitly in the documentation from prod! about the mutated argument sharing memory with any other argument. I would still suggest in this case to use the two-argument mutating version

function TestFunProd!Inplace(Y, X)  # (Also a reminder that I would advise against this naming scheme. Additionally, in this case you would clearly not need a new function, just use prod! directly.)
    prod!(Y, X)
end

where Y = Array{Float64}(undef, size(X, 1)) or Y = similar(@view(X[:, 1])) (or 1:1), and then just use Y later on instead of X[:, 1]. By moving the allocation outside of the method, you increase your options for memory reuse.

1 Like

Thank you both! Using bare loops seems best in my context, thank you for your suggestions and explanations!

1 Like

One last remark: I notice you have 3 as one of your dimensions. If this is a deliberate choice and not merely part of the example, then I’ll add another suggestion. When working with small (e.g., <100 long), fixed-sized vectors, I often find StaticArrays.jl improves ergonomics and/or performance.

julia> using StaticArrays

julia> X = rand(SVector{3,Float64}, 1000); # A Vector of 3-vectors

julia> prod.(X) # broadcast for product over each 3-vector
1000-element Vector{Float64}:
...

The great thing about SVector is that they don’t require any allocations for most operations (although storing a Vector of them will still require memory). They are simply NTuples that are made to act like vectors. I find Vector{SVector{N,T}} to be easier to reason about and work with than a Matrix that encodes the small vectors as rows/columns.

But they are immutable so you won’t be able to write the result into the first column like you want to do in this example. Still, you might find they improve the rest of your code/performance enough to be worth it.

2 Likes