Is there any way to make the second code above work? I want to avoid memory allocation as much as possible since the map! will be called inside a loop.
A simple function like this does not allocate anything:
julia> function maxcol!(matrix,dest)
i = firstindex(dest)
for col in eachcol(matrix)
dest[i] = maximum(col)
i += 1
end
end
maxcol! (generic function with 1 method)
julia> x = rand(10,10);
julia> y = zeros(10);
julia> @btime maxcol!($x,$y)
195.139 ns (0 allocations: 0 bytes)
Indeed, it allocates less. For small arrays (~10 cols) it seems to be somewhat slower than the manual loop, but not for larger ones:
julia> x = rand(1000,1000); y = zeros(1000);
julia> function f!(x,y)
y .= map(maximum,eachcol(x))
nothing
end
f! (generic function with 2 methods)
julia> @btime f!($x,$y)
947.502 μs (1 allocation: 7.94 KiB)
julia> @btime maxcol!($x,$y)
946.911 μs (0 allocations: 0 bytes)
julia> x = rand(10,10); y = zeros(10);
julia> @btime f!($x,$y)
250.669 ns (1 allocation: 160 bytes)
julia> @btime maxcol!($x,$y)
210.236 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.7.0-DEV.536
But that is exactly the same in 1.5.3:
julia> x = rand(1000,1000); y = zeros(1000);
julia> function f!(x,y)
y .= map(maximum,eachcol(x))
nothing
end
f! (generic function with 1 method)
julia> @btime f!($x,$y)
950.048 μs (1 allocation: 7.94 KiB)
julia> versioninfo()
Julia Version 1.5.3
Edit: I noticed now that you are both mentioning explicit Iterators.map which is not the same function as Base.map, but the result is the same with both in 1.7.
Thanks, @lmiq. Yes, a custom function can simply address the issue. My actual need is that I want to use CUDA.map! (which has the same syntax as map!).
The custom function has to be written into a kernel (with some additional lines) and call with @cuda. This is the reason why I want to use map! directly.
It seems for now that the second code snippet map!(maximum, t, eachcol(x)) does not work. I will go with the custom function (i.e., a custom kernel in CUDA).
function maxcol!(matrix, dest)
i = threadIdx().x
if i > size(matrix, 2)
return nothing
end
m = -Inf32
@inbounds for j = 1:size(matrix, 1)
m = max(matrix[j, i], m)
end
dest[i] = m
nothing
end
matrix = CUDA.randn(Float32, 10, 1000)
dest = CUDA.zeros(Float32, 1000);
@cuda threads=1000 maxcol!(matrix, dest)
dest |> Array # to display
Note that
We cannot use maximum inside a kernel directly,; it otherwise tries to launch another kernel by maximum(@view matrix[:,i]) and results in an error. Wondering there are other better solutions rather than writing my own for loop.
The above code is just for demonstration is by no way a high-performance one, e.g., does not take memory coalescing into consideration.