 # Sparse Matrix with CUDA.jl

Hi everyone, I am looking for the most performant way to create a `CuArray` where coefficients are `0` everywhere but `1` at specified indices.
An easy way to do that with regular arrays would be

``````a = randn(1000,1000)
imin = argmin(a,dims=1) # coefficients where we want b[i] =1
b = zeros(size(a))
b[imin] .= 1
``````

But it gets trickier with `CuArray`s

``````using CUDA
CUDA.allowscalar(false)
a = CUDA.randn(1000,1000)
imin = argmin(a,dims=1)
``````

One cannot broadcast in a similar way as above, since we have not allowed scalar because of performance issues. One could do

``````imin = argmin(a,dims=1) |> Array
b = zeros(size(a))
b[imin] .= 1
b = b |> CuArray
``````

but this involves some back and forth between the gpu and the cpu, that is not nice.
Any idea of a trick to get around this problem? Cheers!

This may not be optimal, but seem to be faster on average:

``````b = CUDA.zeros(size(a))
b .= a.==minimum(a;dims=1);
``````

I’d also be interested to know what is the best way to do this.

1 Like

You may want to consider constructing the array in a CSC format.
Note that the timings for `f_csc` are only relevant for the provided MWE, but you would have to rethink how you construct the sparsity pattern in practice.

``````using CUDA
using SparseArrays
CUDA.allowscalar(false)

function f_original(N)
a = randn(N, N)
imin = argmin(a,dims=1) |> Array
b = zeros(size(a))
b[imin] .= 1
b = b |> CuArray
return b
end

function f_csc(N)
dcolptr = cu(collect(1:N+1))
drowval = cu(rand(1:N, N))
dnzval = CUDA.ones(Int64, N)
dsa = CUSPARSE.CuSparseMatrixCSC(dcolptr, drowval, dnzval, (N, N))
return dsa
end

for N_pow in 0:12
N = 2^N_pow
println("\nN = \$N")
@time CUDA.@sync f_original(N)
@time CUDA.@sync f_csc(N)
end
``````
@time results
``````N = 1
0.005567 seconds (24 allocations: 848 bytes)
0.000562 seconds (94 allocations: 3.422 KiB)

N = 2
0.000127 seconds (27 allocations: 1.281 KiB)
0.000565 seconds (94 allocations: 3.453 KiB)

N = 4
0.000173 seconds (27 allocations: 1.547 KiB)
0.000665 seconds (94 allocations: 3.547 KiB)

N = 8
0.000155 seconds (27 allocations: 2.516 KiB)
0.000512 seconds (94 allocations: 3.734 KiB)

N = 16
0.000131 seconds (27 allocations: 5.859 KiB)
0.000475 seconds (94 allocations: 4.109 KiB)

N = 32
0.000124 seconds (24 allocations: 18.156 KiB)
0.000488 seconds (94 allocations: 4.891 KiB)

N = 64
0.000157 seconds (26 allocations: 67.406 KiB)
0.000507 seconds (94 allocations: 6.500 KiB)

N = 128
0.000238 seconds (35 allocations: 262.094 KiB)
0.000495 seconds (94 allocations: 9.688 KiB)

N = 256
0.000495 seconds (26 allocations: 1.011 MiB)
0.000508 seconds (94 allocations: 15.797 KiB)

N = 512
0.001965 seconds (26 allocations: 4.020 MiB)
0.000523 seconds (94 allocations: 27.828 KiB)

N = 1024
0.013047 seconds (203 allocations: 16.043 MiB, 41.24% gc time)
0.001215 seconds (108 allocations: 55.883 KiB)

N = 2048
0.031436 seconds (44 allocations: 64.079 MiB, 7.08% gc time)
0.000608 seconds (101 allocations: 100.219 KiB)

N = 4096
0.115517 seconds (48 allocations: 256.158 MiB, 2.73% gc time)
0.000693 seconds (108 allocations: 196.422 KiB)
``````

Thank you very much for your comprehensive answer. However, it does not seem trivial to convert `CartesianIndex` in a CSC format. But may be I am wrong?

Here is a proposal to do so, that uses the `sparse` method from `SparseArrays`

but as you mentioned in an other post we miss the `sparse` method for `CuSparseMatricCSC`