Hi everyone, I am looking for the most performant way to create a CuArray where coefficients are 0 everywhere but 1 at specified indices.
An easy way to do that with regular arrays would be
a = randn(1000,1000)
imin = argmin(a,dims=1) # coefficients where we want b[i] =1
b = zeros(size(a))
b[imin] .= 1
But it gets trickier with CuArrays
using CUDA
CUDA.allowscalar(false)
a = CUDA.randn(1000,1000)
imin = argmin(a,dims=1)
One cannot broadcast in a similar way as above, since we have not allowed scalar because of performance issues. One could do
imin = argmin(a,dims=1) |> Array
b = zeros(size(a))
b[imin] .= 1
b = b |> CuArray
but this involves some back and forth between the gpu and the cpu, that is not nice.
Any idea of a trick to get around this problem? Cheers!
You may want to consider constructing the array in a CSC format.
Note that the timings for f_csc are only relevant for the provided MWE, but you would have to rethink how you construct the sparsity pattern in practice.
using CUDA
using SparseArrays
CUDA.allowscalar(false)
function f_original(N)
a = randn(N, N)
imin = argmin(a,dims=1) |> Array
b = zeros(size(a))
b[imin] .= 1
b = b |> CuArray
return b
end
function f_csc(N)
dcolptr = cu(collect(1:N+1))
drowval = cu(rand(1:N, N))
dnzval = CUDA.ones(Int64, N)
dsa = CUSPARSE.CuSparseMatrixCSC(dcolptr, drowval, dnzval, (N, N))
return dsa
end
for N_pow in 0:12
N = 2^N_pow
println("\nN = $N")
@time CUDA.@sync f_original(N)
@time CUDA.@sync f_csc(N)
end