Hi,
The problem with TestFun is that the reduce will first create a Vector, which then gets copied over into X. Ideally, there would be an in-place version reduce! but that does not seem to be the case. There is broadcast! though, so you could use for example
function TestFun3(X)
broadcast!(prod, @view(X[:, 1]), eachrow(X))
return X
end
(By the way, note that the usual naming convention would be testfun3! or test_fun_3!, or something similar. Further, I think it’s safer to have a preallocated Y = Array{Float64}(undef, size(X, 1)) ready, and use a two-argument version of the function, where you write to Y instead of the view. But in any case, writing to the view seems to work fine here. You might also have good reasons to want to write back to X[:, 1] directly.)
Your TestFun2 (i.e. explicit loop approach) can also be simplified and improved:
function TestFun4(X)
for j in 2:size(X, 2), i in axes(X, 1)
X[i, 1] *= X[i, j]
end
return X
end
As for the resulting performance, I get:
julia> @btime TestFun($X);
12.211 μs (3 allocations: 15.92 KiB)
julia> @btime TestFun2($X);
18.300 μs (1 allocation: 7.94 KiB)
julia> @btime TestFun3($X);
22.000 μs (0 allocations: 0 bytes)
julia> @btime TestFun4($X);
11.900 μs (0 allocations: 0 bytes)
The reason why TestFun3 is slower, is probably due to us not using X in memory-order.* The nice thing about TestFun4 is that is very straightforward, and explicitly allows us to control the order of the nested loops, for better memory access patterns.
* EDIT: Declaring X = rand(3, 1000) and using eachcol(X) and X[1, :] in TestFun3 is roughly as fast, so this will not the correct explanation as for why TestFun3 is slower. In any case, you might also want to think about how you want to order the data in X in your actual code.