# Reduce allocations in 'reduce + eachcol' call to multiply several columns elementwise

Hi, I find that using `reduce` + `eachcol` in the following allocates more than I would like. The point is to element-wise multiply some columns of X. I’m getting `n` allocations where `n` is the number of rows when using `reduce` + `eachcol`. Using `prod(..., dims = 2)` allocates even more. I know it is not much, but in my code a similar step is repeated very often, so if I can reduce allocations, that would be great!

``````using BenchmarkTools

X = rand(1000, 3)
function TestFun(X)
@views X[:, 1] .= reduce(.*, eachcol(X))
#@views X[:, 1] .= prod(X, dims = 2)
return X
end

@benchmark TestFun(X)
``````

I’m aware of the following by the way, which reduces the number of allocations to 1. However, this feels to me like there would be an even easier/notationally nicer way to do this…

``````function TestFun2(X)
tmp = ones(size(X, 1))
for i in eachcol(X)
tmp .= tmp .* i
end
@views X[:, 1] .= tmp
return X
end
``````

Hi,

The problem with `TestFun` is that the `reduce` will first create a `Vector`, which then gets copied over into `X`. Ideally, there would be an in-place version `reduce!` but that does not seem to be the case. There is `broadcast!` though, so you could use for example

``````function TestFun3(X)
broadcast!(prod, @view(X[:, 1]), eachrow(X))
return X
end
``````

(By the way, note that the usual naming convention would be `testfun3!` or `test_fun_3!`, or something similar. Further, I think it’s safer to have a preallocated `Y = Array{Float64}(undef, size(X, 1))` ready, and use a two-argument version of the function, where you write to `Y` instead of the view. But in any case, writing to the view seems to work fine here. You might also have good reasons to want to write back to `X[:, 1]` directly.)

Your `TestFun2` (i.e. explicit loop approach) can also be simplified and improved:

``````function TestFun4(X)
for j in 2:size(X, 2), i in axes(X, 1)
X[i, 1] *= X[i, j]
end
return X
end
``````

As for the resulting performance, I get:

``````julia> @btime TestFun(\$X);
12.211 μs (3 allocations: 15.92 KiB)
julia> @btime TestFun2(\$X);
18.300 μs (1 allocation: 7.94 KiB)
julia> @btime TestFun3(\$X);
22.000 μs (0 allocations: 0 bytes)
julia> @btime TestFun4(\$X);
11.900 μs (0 allocations: 0 bytes)
``````

The reason why `TestFun3` is slower, is probably due to us not using `X` in memory-order.* The nice thing about `TestFun4` is that is very straightforward, and explicitly allows us to control the order of the nested loops, for better memory access patterns.

* EDIT: Declaring `X = rand(3, 1000)` and using `eachcol(X)` and `X[1, :]` in `TestFun3` is roughly as fast, so this will not the correct explanation as for why `TestFun3` is slower. In any case, you might also want to think about how you want to order the data in `X` in your actual code.

1 Like

In this case, I see decent performance and only one allocation from `prod!` a la

``````julia> prod!(similar(@view(X[:,1:1])), X)
``````

If you are feeling extremely brave, you can drop the `similar` and this will run entirely in place. IT MAY NOT PRODUCE THE CORRECT RESULT WITHOUT THE `similar`. In one test of mine it worked no it didn’t, I was just sloppy about reading the result, but it could break in any version for any input because the output and input alias which is not promised to work. The above-suggested `TestFun3` based on `broadcast!`into a view has this same issue. ONE SHOULD REALLY NOT DO THIS.

The `TestFun4` is the best implementation. Don’t be afraid to write your own loops!

I don’t know why `prod(...;dims)` is notably worse than `prod!` (even when we count allocating the result for `prod!`). Hopefully that is improved sometime. PRs welcome! Although I suspect this is a complicated one to improve in general.

1 Like

On my machine there is no performance difference between both:

``````julia> function TestFunProd(X)
@views X[:, 1] .= prod(X, dims = 2)  # (By the way, like in TestFun itself, the @views is useless here)
return X
end;

julia> function TestFunProd!(X)  # (I'm aware this is a confusing name, see note in my first post)
X[:, 1] .= prod!(similar(@view(X[:, 1:1])), X)
return X
end

julia> @btime TestFunProd(\$X);
17.900 μs (5 allocations: 8.02 KiB)

julia> @btime TestFunProd!(\$X);
17.800 μs (1 allocation: 7.94 KiB)
``````

So if `TestFunProd` is indeed noticeably slower for you, then I certainly agree that it will be complicated to improve `prod(...; dims)` in general .

versioninfo
``````julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
``````

Without the `similar` I actually get different, i.e. incorrect, results here (whereas for `TestFun3`’s `broadcast!` it was fine for me).

``````julia> X = rand(1, 3)
0.909626  0.930354  0.564601

julia> function TestFunProd!Unsafe(X)
prod!(@view(X[:, 1]), X)  # (or 1:1, gives the same results)
return X
end;

julia> prod(X)
0.47780729264507804

julia> TestFunProd!Unsafe(X)
1×3 Matrix{Float64}:
0.525279  0.930354  0.564601
``````

The `X[:, 1]` entries seem to be ignored for the product. I haven’t looked at the source code for `prod!`, but I can imagine that for `prod!(Y, X)` you initialise `Y` at ones, and then start accumulating factors from `X`, which would explain the incorrect results.

In any case, it emphasises your warning and indeed the one explicitly in the documentation from `prod!` about the mutated argument sharing memory with any other argument. I would still suggest in this case to use the two-argument mutating version

``````function TestFunProd!Inplace(Y, X)  # (Also a reminder that I would advise against this naming scheme. Additionally, in this case you would clearly not need a new function, just use prod! directly.)
prod!(Y, X)
end
``````

where `Y = Array{Float64}(undef, size(X, 1))` or `Y = similar(@view(X[:, 1]))` (or `1:1`), and then just use `Y` later on instead of `X[:, 1]`. By moving the allocation outside of the method, you increase your options for memory reuse.

1 Like

Thank you both! Using bare loops seems best in my context, thank you for your suggestions and explanations!

1 Like

One last remark: I notice you have `3` as one of your dimensions. If this is a deliberate choice and not merely part of the example, then I’ll add another suggestion. When working with small (e.g., <100 long), fixed-sized vectors, I often find StaticArrays.jl improves ergonomics and/or performance.

``````julia> using StaticArrays

julia> X = rand(SVector{3,Float64}, 1000); # A Vector of 3-vectors

julia> prod.(X) # broadcast for product over each 3-vector
1000-element Vector{Float64}:
...
``````

The great thing about `SVector` is that they don’t require any allocations for most operations (although storing a `Vector` of them will still require memory). They are simply `NTuple`s that are made to act like vectors. I find `Vector{SVector{N,T}}` to be easier to reason about and work with than a `Matrix` that encodes the small vectors as rows/columns.

But they are immutable so you won’t be able to write the result into the first column like you want to do in this example. Still, you might find they improve the rest of your code/performance enough to be worth it.

2 Likes