GPU (CUDA) vs CPU QR decomposition performance of dense wide complex matrix

Hello all,

I have an algorithm that requires to compute the QR decomposition (though I’m only interested in R) of a wide complex double precision matrix with a typical size of (400 rows x 16000 columns).
I have squeezed all I can by removing allocations and the profiler shows that most of the time spent is on qr!.
I figured I could make use of a GPU for that (maybe a misconception on my part) but in the end I get mostly worse results taking into account the transfer cost.

Are there obvious reasons I’m missing to explain such little difference ?

Here are the small tests I did on my machine. I also tested with a single precision matrix to see the effect.

using CUDA, LinearAlgebra, BenchmarkTools

function R_QRGPU(A::AbstractMatrix{T}) where {T<:Number} 
    A = deepcopy(A)
    Agpu = CuMatrix(A)
    _,Rgpu = qr!(Agpu)
    R = Array(Rgpu)
    return R
end

function R_QRGPU2(A::AbstractMatrix{T}) where {T<:Number} 
    A = deepcopy(A)
    Agpu = cu(A, unified=true)
    _,Rgpu = qr!(Agpu)
    R = unsafe_wrap(Array,Rgpu)
    return R
end

function R_QRCPU(A::AbstractMatrix{T}) where {T<:Number} 
    A = deepcopy(A)
    _,R = qr!(A)
    return R
end

A64 = rand(ComplexF64,400,16000)
A32 = ComplexF32.(A64)

The results are the following

julia> @benchmark R_QRCPU($A64)
BenchmarkTools.Trial: 13 samples with 1 evaluation per sample.
 Range (min … max):  209.627 ms … 948.529 ms  ┊ GC (min … max):  0.00% … 76.82%
 Time  (median):     244.860 ms               ┊ GC (median):     4.98%
 Time  (mean ± σ):   398.477 ms ± 285.616 ms  ┊ GC (mean ± σ):  43.94% ± 30.96%

  ▃█▃                                                            
  ███▇▁▁▁▁▇▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▇▁▁▁▁▇ ▁
  210 ms           Histogram: frequency by time          949 ms <

 Memory estimate: 204.32 MiB, allocs estimate: 15.

julia> @benchmark R_QRCPU($A32)
BenchmarkTools.Trial: 30 samples with 1 evaluation per sample.
 Range (min … max):  109.482 ms … 778.549 ms  ┊ GC (min … max):  0.00% … 83.67%
 Time  (median):     115.809 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   167.645 ms ± 165.512 ms  ┊ GC (mean ± σ):  31.24% ± 22.85%

  █                                                              
  █▆▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃ ▁
  109 ms           Histogram: frequency by time          779 ms <

 Memory estimate: 102.16 MiB, allocs estimate: 15.

julia> @benchmark R_QRGPU($A64)
BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
 Range (min … max):  519.520 ms … 561.771 ms  ┊ GC (min … max): 3.49% … 0.00%
 Time  (median):     521.680 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   526.390 ms ±  12.855 ms  ┊ GC (mean ± σ):  0.68% ± 1.44%

  █  ▃                                                           
  █▇▇█▁▁▁▁▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  520 ms           Histogram: frequency by time          562 ms <

 Memory estimate: 97.67 MiB, allocs estimate: 661.

julia> @benchmark R_QRGPU($A32)
BenchmarkTools.Trial: 40 samples with 1 evaluation per sample.
 Range (min … max):  122.330 ms … 127.065 ms  ┊ GC (min … max): 2.62% … 3.51%
 Time  (median):     124.959 ms               ┊ GC (median):    3.31%
 Time  (mean ± σ):   124.996 ms ±   1.152 ms  ┊ GC (mean ± σ):  3.12% ± 0.44%

       █                    ▃▃█ █ ▃ ▃ ▃   ▃     ▃            ▃   
  ▇▁▁▁▁█▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁███▇█▁█▇█▁█▇▇▇█▁▇▇▁▁█▁▇▁▇▁▇▁▁▁▁▁▇█▇ ▁
  122 ms           Histogram: frequency by time          127 ms <

 Memory estimate: 48.84 MiB, allocs estimate: 664.

julia> @benchmark R_QRGPU2($A64)
BenchmarkTools.Trial: 6 samples with 1 evaluation per sample.
 Range (min … max):  661.982 ms …    1.386 s  ┊ GC (min … max):  0.00% … 50.65%
 Time  (median):     722.471 ms               ┊ GC (median):     0.80%
 Time  (mean ± σ):   847.320 ms ± 283.602 ms  ┊ GC (mean ± σ):  14.36% ± 20.43%

  █                                                              
  █▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  662 ms           Histogram: frequency by time          1.39 s <

 Memory estimate: 146.50 MiB, allocs estimate: 675.

julia> @benchmark R_QRGPU2($A32)
BenchmarkTools.Trial: 8 samples with 1 evaluation per sample.
 Range (min … max):  632.919 ms … 764.252 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     691.256 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   685.548 ms ±  45.239 ms  ┊ GC (mean ± σ):  0.46% ± 0.85%

  █ █  █                    ██      █  █                      █  
  █▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  633 ms           Histogram: frequency by time          764 ms <

 Memory estimate: 48.84 MiB, allocs estimate: 672.