Just to add to this, after staring at your function and @barucden’s exaplantion for a bit it seems you are adding a two matrices, each subsetted to only include certain rows, together and then returning a vector holding a vector for each column in the matrix after addition, is that right?
If so, a simple implementation might be:
f4(y, ŷ, m) = eachcol(y[m, :] .+ ŷ[m, :]);
here, m is also an input to the function so we’re not relying on global variables. The code is also supremely readable and easily understood imho - add matrices y
and ŷ
, restricting them to only include the rows selected by m
. How are we doing with this?
julia> t1 = rand(100_000, 10); t2 = rand(100_000, 10); m = rand(Bool, 100_000);
julia> f3 = (y, ŷ) -> ((y, ŷ) -> (y[m] + ŷ[m]) ).((eachcol.((y, ŷ)))...);
julia> f4(y, ŷ, m) = eachcol(y[m, :] .+ ŷ[m, :]);
julia> using BenchmarkTools
julia> @btime f3($t1, $t2);
6.704 ms (190 allocations: 11.52 MiB)
julia> @btime f4($t1, $t2, $m);
7.678 ms (6 allocations: 11.51 MiB)
roughly the same performance (timings are a bit variable on my machine), but we’ve cut down allocations by a factor of of c. 30 by removing the access to global variables. We are cheating a little bit though as they functions do not return the same thing:
julia> typeof(f3(t1, t2))
Vector{Vector{Float64}} (alias for Array{Array{Float64, 1}, 1})
julia> typeof(f4(t1, t2, m))
ColumnSlices{Matrix{Float64}, Tuple{OneTo{Int64}}, SubArray{Float64, 1, Matrix{Float64}, Tuple{Slice{OneTo{Int64}}, Int64}, true}}
Essentially eachcol
is just returning an iterator, which we need to collect
to get a vector of vectors like we got from f3
:
julia> f5(y, ŷ, m) = collect(eachcol(y[m, :] .+ ŷ[m, :]));
How does this fare?
julia> @btime f5($t1, $t2, $m);
7.744 ms (7 allocations: 11.51 MiB)
We’ve added an allocation as we are now materializing the iterator through collect
- you should think about whether this is necessary in your actual code, as for the most part Julia works just fine with iterators that are never materialized.
Now if you look at the total memory footprint of the allocations, you see that while there are much fewer allocations, their total size is almost unchanged (11.52MB to 11.51MB) - indicating that we are likely only saving very small allocations related to the global variable type checks. If you’ve also looked at the performance tips I linke above you’ll have come across the section on using views for slices. In f4
and f5
we are doing y[m, :]
which creates a copy of y
by default and therefore allocates. The helpful @views
macro can turn these copies into views, saving on allocations:
julia> f76(y, ŷ, m) = @views eachcol(y[m, :] .+ ŷ[m, :]);
Checking performance for this:
julia> @btime f7($t1, $t2, $m);
1.766 ms (6 allocations: 4.60 MiB)
Much better - we have cut allocations by a factor of ~30 and runtime by a factor of ~4, while also making the code (in my view) much more legible and easy to understand.