Dot-product of CuArray views is slow

I have fairly simple function that iterates over the columns of a 2-dimensional CuArray , takes inner products of these columns and writes them to another matrix. I noticed that running this code on the GPU gave no speed-up, even though simply taking the inner product of two CuArrays was much faster on my machine, than with Arrays. I assume that the problem are the views.

See the following MWE:

using CUDA
using BenchmarkTools
using LinearAlgebra

function view_multiplication(Ψ::AbstractMatrix)
    d1, d2 = size(ψ)
    out = Array{eltype(ψ)}(undef, d2, d2)
    for i in 1:d2
        for j in 1:i
            @inbounds @views out[i,j] = ψ[:,i] ⋅ ψ[:,j]
            @inbounds out[j,i] = out[i,j]'
        end
    end
    return out
end

and running this on a (2^18 x 2^4) matrix I get on the CPU (the second time, after precompilation)

ψ = rand(ComplexF32, 2^18, 2^4)
@time view_multiplication(ψ)
0.048713 seconds (1.41 k allocations: 35.547 KiB)

and on the GPU (also without precompilation)

ψ = CUDA.rand(ComplexF32, 2^18, 2^4)
@time view_multiplication(ψ)
0.226695 seconds (3.94 k allocations: 544.086 MiB, 12.08% gc time)

which is much slower than I hoped for and also allocates much more memory. Is there a way to take inner products of views of CuArrays without needing to allocate memory for the views?

1 Like

Why do you assume that?

You are launching many short operations. That’s an inefficient way of programming a GPU, because it will be hard to saturate the device. Furthermore, each of your operations reduces to a scalar, so it effectively synchronizes execution which further reduces the speed-up you can get from using a GPU.

Each of the innermost operations is an inner product of two length 2^18 vectors. Is that not big enough of an operation to make using the GPU worthwile? And there are only 2^16 = 256 of these. Or should I rather try to parallelize those 256 operations?

It can, but you need to be much more careful when launching such operations because any overhead is going to be very noticeable. And that’s where my second point comes in: calling a function that returns a scalar requires synchronization, killing performance.

All this is very noticable when profiling this code; I recommend you have a look at Benchmarking & profiling · CUDA.jl and give it a try yourself. On your original code, you see:


Essentially, every iteration spawns a very short operation, dwarfed by the (surprisingly) slow memory copy done by CUBLAS’ dot routine.

The obvious fix is to use an in-place method that takes an output argument. CUDA.jl’s dot function doesn’t offer such an API, because Base doesn’t, but you can use mul!:

vec_out = reshape(@view(out[i,j]), 1)   # CUDA.jl's mul! doesn't like 0d
mul!(vec_out, @view(A[:,i])', @view(A[:,j]))

This avoids the memory copy at every step, and makes it possible to asynchronously enqueue the mul! and only have to wait for it once. This is clear in the profiler, where the (cyan) highlighted sections show respectively where the operation was enqueued, and when it was executed:

This improves performance 200-fold on my machine. Do note that you need to add CUDA.@sync around your benchmarking code to force synchronization; or copy back to a CPU buffer which does this implicitly.

2 Likes

Hmm, my code is now

function view_multiplication(Ψ::AbstractMatrix)
    d1, d2 = size(ψ)
    out = similar(ψ, d2, d2)
    for i in 1:d2
        for j in 1:d2
            vec_out = reshape(@view(out[i,j]), 1)   # CUDA.jl's mul! doesn't like 0d
            mul!(vec_out, @view(ψ[:,i])', @view(ψ[:,j]))
        end
    end
    return out
end

and I still get the same run times and allocations / memory copies. Or did I misunderstand how you meant to use mul!?

using CUDA
using LinearAlgebra

function old(A::AbstractMatrix)
    d1, d2 = size(A)
    out = Array{eltype(A)}(undef, d2, d2)
    for i in 1:d2
        for j in 1:i
            @inbounds @views out[i,j] = A[:,i] ⋅ A[:,j]
            @inbounds out[j,i] = out[i,j]'
        end
    end
    return out
end

function new(A::AbstractMatrix)
    d1, d2 = size(A)
    out = similar(A, d2, d2)
    for i in 1:d2
        for j in 1:d2
            vec_out = reshape(@view(out[i,j]), 1)   # CUDA.jl's mul! doesn't like 0d
            mul!(vec_out, @view(A[:,i])', @view(A[:,j]))
        end
    end
    return out
end
julia> A = CUDA.rand(ComplexF32, 2^18, 2^4)

julia> @benchmark CUDA.@sync old($A)
BenchmarkTools.Trial: 
  memory estimate:  48.97 KiB
  allocs estimate:  2319
  --------------
  minimum time:     280.837 ms (0.00% GC)
  median time:      310.700 ms (0.00% GC)
  mean time:        306.896 ms (0.00% GC)
  maximum time:     310.896 ms (0.00% GC)
  --------------
  samples:          17
  evals/sample:     1

julia> @benchmark CUDA.@sync new($A)
BenchmarkTools.Trial: 
  memory estimate:  367.16 KiB
  allocs estimate:  16580
  --------------
  minimum time:     5.712 ms (0.00% GC)
  median time:      7.185 ms (0.00% GC)
  mean time:        7.666 ms (1.52% GC)
  maximum time:     47.268 ms (24.64% GC)
  --------------
  samples:          651
  evals/sample:     1

Use the available tools, i.e. the profiler, to figure out what’s happening on your system.

1 Like

Interestingly, a reboot resulted in much faster timings for the initial case, so there was something weird going on with my GPU. Either way, profiling the mul! case again revealed that the operation is just too short to saturate the GPU, so the launch overhead dominates:

You’ll have to look into fusing these individual iterations, either using batched APIs, or maybe something like Tullio.jl can help you.


FWIW, calling CUBLAS’ dots using a device buffer (as I hinted to in the first post) doesn’t help either, as it still synchronizes.

1 Like

Somehow the same code still allocates for every view on my machine, will try rebooting and the profiling tools.

Seems GPU programming is harder than I thought :laughing: But thanks for the help so far!

This doesn’t matter; the allocations reported by @time are CPU allocations, and don’t matter here. Again, use the available tools as documented. The profiler gives you a nice timeline, while CUDA.@time can report on GPU allocations.

julia> ψ = CUDA.rand(ComplexF32, 2^18, 2^4)
julia> @benchmark CUDA.@sync new($ψ)

BenchmarkTools.Trial: 
  memory estimate:  1.00 GiB
  allocs estimate:  16816
  --------------
  minimum time:     373.148 ms (8.70% GC)
  median time:      391.971 ms (8.40% GC)
  mean time:        392.999 ms (8.49% GC)
  maximum time:     416.186 ms (8.81% GC)
  --------------
  samples:          13
  evals/sample:     1

definitely looks like there are unnecessary allocations going on. Especially compared to the case on the CPU:

julias> ψ = rand(ComplexF32, 2^18, 2^4)
juilia> @benchmark new($ψ)

BenchmarkTools.Trial: 
  memory estimate:  99.27 KiB
  allocs estimate:  2342
  --------------
  minimum time:     99.333 ms (0.00% GC)
  median time:      104.113 ms (0.00% GC)
  mean time:        105.316 ms (0.00% GC)
  maximum time:     115.054 ms (0.00% GC)
  --------------
  samples:          48
  evals/sample:     1

Sure, but CPU allocations of CuArray objects are very fast (sub 1us) and unlikely to be the cause of the performance difference here.