Dot-product of CuArray views is slow

jlbosse · May 11, 2021, 2:05pm

I have fairly simple function that iterates over the columns of a 2-dimensional CuArray , takes inner products of these columns and writes them to another matrix. I noticed that running this code on the GPU gave no speed-up, even though simply taking the inner product of two CuArrays was much faster on my machine, than with Arrays. I assume that the problem are the views.

See the following MWE:

using CUDA
using BenchmarkTools
using LinearAlgebra

function view_multiplication(Ψ::AbstractMatrix)
    d1, d2 = size(ψ)
    out = Array{eltype(ψ)}(undef, d2, d2)
    for i in 1:d2
        for j in 1:i
            @inbounds @views out[i,j] = ψ[:,i] ⋅ ψ[:,j]
            @inbounds out[j,i] = out[i,j]'
        end
    end
    return out
end

and running this on a (2^18 x 2^4) matrix I get on the CPU (the second time, after precompilation)

ψ = rand(ComplexF32, 2^18, 2^4)
@time view_multiplication(ψ)
0.048713 seconds (1.41 k allocations: 35.547 KiB)

and on the GPU (also without precompilation)

ψ = CUDA.rand(ComplexF32, 2^18, 2^4)
@time view_multiplication(ψ)
0.226695 seconds (3.94 k allocations: 544.086 MiB, 12.08% gc time)

which is much slower than I hoped for and also allocates much more memory. Is there a way to take inner products of views of CuArrays without needing to allocate memory for the views?

maleadt · May 11, 2021, 2:21pm

Why do you assume that?

You are launching many short operations. That’s an inefficient way of programming a GPU, because it will be hard to saturate the device. Furthermore, each of your operations reduces to a scalar, so it effectively synchronizes execution which further reduces the speed-up you can get from using a GPU.

jlbosse · May 11, 2021, 2:33pm

Each of the innermost operations is an inner product of two length 2^18 vectors. Is that not big enough of an operation to make using the GPU worthwile? And there are only 2^16 = 256 of these. Or should I rather try to parallelize those 256 operations?

maleadt · May 11, 2021, 3:01pm

It can, but you need to be much more careful when launching such operations because any overhead is going to be very noticeable. And that’s where my second point comes in: calling a function that returns a scalar requires synchronization, killing performance.

All this is very noticable when profiling this code; I recommend you have a look at Benchmarking & profiling · CUDA.jl and give it a try yourself. On your original code, you see:

Essentially, every iteration spawns a very short operation, dwarfed by the (surprisingly) slow memory copy done by CUBLAS’ dot routine.

The obvious fix is to use an in-place method that takes an output argument. CUDA.jl’s dot function doesn’t offer such an API, because Base doesn’t, but you can use mul!:

vec_out = reshape(@view(out[i,j]), 1)   # CUDA.jl's mul! doesn't like 0d
mul!(vec_out, @view(A[:,i])', @view(A[:,j]))

This avoids the memory copy at every step, and makes it possible to asynchronously enqueue the mul! and only have to wait for it once. This is clear in the profiler, where the (cyan) highlighted sections show respectively where the operation was enqueued, and when it was executed:

This improves performance 200-fold on my machine. Do note that you need to add CUDA.@sync around your benchmarking code to force synchronization; or copy back to a CPU buffer which does this implicitly.

jlbosse · May 11, 2021, 3:29pm

Hmm, my code is now

function view_multiplication(Ψ::AbstractMatrix)
    d1, d2 = size(ψ)
    out = similar(ψ, d2, d2)
    for i in 1:d2
        for j in 1:d2
            vec_out = reshape(@view(out[i,j]), 1)   # CUDA.jl's mul! doesn't like 0d
            mul!(vec_out, @view(ψ[:,i])', @view(ψ[:,j]))
        end
    end
    return out
end

and I still get the same run times and allocations / memory copies. Or did I misunderstand how you meant to use mul!?

maleadt · May 11, 2021, 3:38pm

using CUDA
using LinearAlgebra

function old(A::AbstractMatrix)
    d1, d2 = size(A)
    out = Array{eltype(A)}(undef, d2, d2)
    for i in 1:d2
        for j in 1:i
            @inbounds @views out[i,j] = A[:,i] ⋅ A[:,j]
            @inbounds out[j,i] = out[i,j]'
        end
    end
    return out
end

function new(A::AbstractMatrix)
    d1, d2 = size(A)
    out = similar(A, d2, d2)
    for i in 1:d2
        for j in 1:d2
            vec_out = reshape(@view(out[i,j]), 1)   # CUDA.jl's mul! doesn't like 0d
            mul!(vec_out, @view(A[:,i])', @view(A[:,j]))
        end
    end
    return out
end

julia> A = CUDA.rand(ComplexF32, 2^18, 2^4)

julia> @benchmark CUDA.@sync old($A)
BenchmarkTools.Trial: 
  memory estimate:  48.97 KiB
  allocs estimate:  2319
  --------------
  minimum time:     280.837 ms (0.00% GC)
  median time:      310.700 ms (0.00% GC)
  mean time:        306.896 ms (0.00% GC)
  maximum time:     310.896 ms (0.00% GC)
  --------------
  samples:          17
  evals/sample:     1

julia> @benchmark CUDA.@sync new($A)
BenchmarkTools.Trial: 
  memory estimate:  367.16 KiB
  allocs estimate:  16580
  --------------
  minimum time:     5.712 ms (0.00% GC)
  median time:      7.185 ms (0.00% GC)
  mean time:        7.666 ms (1.52% GC)
  maximum time:     47.268 ms (24.64% GC)
  --------------
  samples:          651
  evals/sample:     1

Use the available tools, i.e. the profiler, to figure out what’s happening on your system.

maleadt · May 11, 2021, 4:04pm

Interestingly, a reboot resulted in much faster timings for the initial case, so there was something weird going on with my GPU. Either way, profiling the mul! case again revealed that the operation is just too short to saturate the GPU, so the launch overhead dominates:

You’ll have to look into fusing these individual iterations, either using batched APIs, or maybe something like Tullio.jl can help you.

FWIW, calling CUBLAS’ dots using a device buffer (as I hinted to in the first post) doesn’t help either, as it still synchronizes.

jlbosse · May 11, 2021, 4:23pm

Somehow the same code still allocates for every view on my machine, will try rebooting and the profiling tools.

Seems GPU programming is harder than I thought But thanks for the help so far!

maleadt · May 11, 2021, 4:24pm

This doesn’t matter; the allocations reported by @time are CPU allocations, and don’t matter here. Again, use the available tools as documented. The profiler gives you a nice timeline, while CUDA.@time can report on GPU allocations.

jlbosse · May 11, 2021, 4:32pm

julia> ψ = CUDA.rand(ComplexF32, 2^18, 2^4)
julia> @benchmark CUDA.@sync new($ψ)

BenchmarkTools.Trial: 
  memory estimate:  1.00 GiB
  allocs estimate:  16816
  --------------
  minimum time:     373.148 ms (8.70% GC)
  median time:      391.971 ms (8.40% GC)
  mean time:        392.999 ms (8.49% GC)
  maximum time:     416.186 ms (8.81% GC)
  --------------
  samples:          13
  evals/sample:     1

definitely looks like there are unnecessary allocations going on. Especially compared to the case on the CPU:

julias> ψ = rand(ComplexF32, 2^18, 2^4)
juilia> @benchmark new($ψ)

BenchmarkTools.Trial: 
  memory estimate:  99.27 KiB
  allocs estimate:  2342
  --------------
  minimum time:     99.333 ms (0.00% GC)
  median time:      104.113 ms (0.00% GC)
  mean time:        105.316 ms (0.00% GC)
  maximum time:     115.054 ms (0.00% GC)
  --------------
  samples:          48
  evals/sample:     1

maleadt · May 11, 2021, 4:37pm

Sure, but CPU allocations of CuArray objects are very fast (sub 1us) and unlikely to be the cause of the performance difference here.

Topic		Replies	Views
Result of inner product of two CuArray with views is incorrect GPU linearalgebra	2	128	October 22, 2024
Performance of view with cuArrays GPU	11	2670	November 11, 2018
Using @view with CuArrays GPU	6	1131	September 20, 2023
How to copy view of CuArray to Array efficiently? GPU question , cuda	4	166	October 6, 2024
Thousands of matrix multiplications using CuArray GPU	5	1224	July 11, 2019

Dot-product of CuArray views is slow

Related topics