Hello all,
I have an algorithm that requires to compute the QR decomposition (though I’m only interested in R) of a wide complex double precision matrix with a typical size of (400 rows x 16000 columns).
I have squeezed all I can by removing allocations and the profiler shows that most of the time spent is on qr!.
I figured I could make use of a GPU for that (maybe a misconception on my part) but in the end I get mostly worse results taking into account the transfer cost.
Are there obvious reasons I’m missing to explain such little difference ?
Here are the small tests I did on my machine. I also tested with a single precision matrix to see the effect.
using CUDA, LinearAlgebra, BenchmarkTools
function R_QRGPU(A::AbstractMatrix{T}) where {T<:Number}
A = deepcopy(A)
Agpu = CuMatrix(A)
_,Rgpu = qr!(Agpu)
R = Array(Rgpu)
return R
end
function R_QRGPU2(A::AbstractMatrix{T}) where {T<:Number}
A = deepcopy(A)
Agpu = cu(A, unified=true)
_,Rgpu = qr!(Agpu)
R = unsafe_wrap(Array,Rgpu)
return R
end
function R_QRCPU(A::AbstractMatrix{T}) where {T<:Number}
A = deepcopy(A)
_,R = qr!(A)
return R
end
A64 = rand(ComplexF64,400,16000)
A32 = ComplexF32.(A64)
The results are the following
julia> @benchmark R_QRCPU($A64)
BenchmarkTools.Trial: 13 samples with 1 evaluation per sample.
Range (min … max): 209.627 ms … 948.529 ms ┊ GC (min … max): 0.00% … 76.82%
Time (median): 244.860 ms ┊ GC (median): 4.98%
Time (mean ± σ): 398.477 ms ± 285.616 ms ┊ GC (mean ± σ): 43.94% ± 30.96%
▃█▃
███▇▁▁▁▁▇▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▇▁▁▁▁▇ ▁
210 ms Histogram: frequency by time 949 ms <
Memory estimate: 204.32 MiB, allocs estimate: 15.
julia> @benchmark R_QRCPU($A32)
BenchmarkTools.Trial: 30 samples with 1 evaluation per sample.
Range (min … max): 109.482 ms … 778.549 ms ┊ GC (min … max): 0.00% … 83.67%
Time (median): 115.809 ms ┊ GC (median): 0.00%
Time (mean ± σ): 167.645 ms ± 165.512 ms ┊ GC (mean ± σ): 31.24% ± 22.85%
█
█▆▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃ ▁
109 ms Histogram: frequency by time 779 ms <
Memory estimate: 102.16 MiB, allocs estimate: 15.
julia> @benchmark R_QRGPU($A64)
BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
Range (min … max): 519.520 ms … 561.771 ms ┊ GC (min … max): 3.49% … 0.00%
Time (median): 521.680 ms ┊ GC (median): 0.00%
Time (mean ± σ): 526.390 ms ± 12.855 ms ┊ GC (mean ± σ): 0.68% ± 1.44%
█ ▃
█▇▇█▁▁▁▁▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
520 ms Histogram: frequency by time 562 ms <
Memory estimate: 97.67 MiB, allocs estimate: 661.
julia> @benchmark R_QRGPU($A32)
BenchmarkTools.Trial: 40 samples with 1 evaluation per sample.
Range (min … max): 122.330 ms … 127.065 ms ┊ GC (min … max): 2.62% … 3.51%
Time (median): 124.959 ms ┊ GC (median): 3.31%
Time (mean ± σ): 124.996 ms ± 1.152 ms ┊ GC (mean ± σ): 3.12% ± 0.44%
█ ▃▃█ █ ▃ ▃ ▃ ▃ ▃ ▃
▇▁▁▁▁█▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁███▇█▁█▇█▁█▇▇▇█▁▇▇▁▁█▁▇▁▇▁▇▁▁▁▁▁▇█▇ ▁
122 ms Histogram: frequency by time 127 ms <
Memory estimate: 48.84 MiB, allocs estimate: 664.
julia> @benchmark R_QRGPU2($A64)
BenchmarkTools.Trial: 6 samples with 1 evaluation per sample.
Range (min … max): 661.982 ms … 1.386 s ┊ GC (min … max): 0.00% … 50.65%
Time (median): 722.471 ms ┊ GC (median): 0.80%
Time (mean ± σ): 847.320 ms ± 283.602 ms ┊ GC (mean ± σ): 14.36% ± 20.43%
█
█▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
662 ms Histogram: frequency by time 1.39 s <
Memory estimate: 146.50 MiB, allocs estimate: 675.
julia> @benchmark R_QRGPU2($A32)
BenchmarkTools.Trial: 8 samples with 1 evaluation per sample.
Range (min … max): 632.919 ms … 764.252 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 691.256 ms ┊ GC (median): 0.00%
Time (mean ± σ): 685.548 ms ± 45.239 ms ┊ GC (mean ± σ): 0.46% ± 0.85%
█ █ █ ██ █ █ █
█▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
633 ms Histogram: frequency by time 764 ms <
Memory estimate: 48.84 MiB, allocs estimate: 672.