Thousands of matrix multiplications using CuArray

joseph_bouvard · July 10, 2019, 2:19pm

Using CuArray.jl it was possible to improve the following code at about 7x when compare to the analogue computation executed on host. However, it is still disappointing for my purposes (53.122 s running on a Tesla K40c with julia1.0). I have also insert the profiling result below. Any suggestions on how to improve performance would be welcome.

using CuArrays
using CuArrays.CUBLAS
using CUDAdrv
using BenchmarkTools

function times(d_A::CuArray{Float32,2}, 
	       d_B::CuArray{Float32,1}, 
	       d_C::CuArray{Float32,1})

	CuArrays.CUBLAS.gemv!('N',1f0,d_A,d_B,0f0,d_C)
end

n1 = 2001; n2 = 400; n3 = 700

d_A = CuArray(rand(Float32,n1*n2,n3))
d_B = cu(rand(Float32,n3))
d_C = cu(zeros(Float32,n1*n2))

@btime for i=1:n1
	times(d_A, d_B, d_C)
end

==140719== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  214.082s      8004  26.747ms  26.672ms  26.807ms  void gemv2N_kernel_val<float, float, float, int=128, int=4, int=4, int=4, int=1, cublasGemvParams<cublasGemvTensor<float const >, cublasGemvTensor<float>, float>>(float, float, float const )
      API calls:  100.00%  212.814s      8004  26.588ms  5.0930us  26.9696s  cudaLaunchKernel
                    0.00%  2.5847ms      8004     322ns     145ns  463.36us  cudaGetLastError
                    0.00%  2.2860us         1  2.2860us  2.2860us  2.2860us  cuDeviceGetCount

jling · July 10, 2019, 3:17pm

sorry, why do you want to run times 2001 times?

I guess a natural question is, how fast do you expect it to be? to me it looks like you’ve already moved everything onto GPU and they are already Float32, so there’s no faster route. Do you know how much time it takes in, for example, Tensorflow? with GPU?

joseph_bouvard · July 10, 2019, 4:18pm

If I run the excacly same size matrix just once instead of the 2001 times I get 10 microsec (using BenchmarkTools). Is it a very incorrect assumption to expect that the time would increase linearly with the number of computations?

I actually need to run 5001 times or more this matrix vector multiplication, it stands for time index in a wavefield propagation .

Thank you for your input

jling · July 10, 2019, 4:37pm

@btime would repeat whatever thing you’re calculating many times and take average (and neglect compile time), as a standard practice, you should wrap things in a function and call @btime on that function.

Unless there are overheads in IO (or allocation), matrix multiplication time is absolutely linear — your matrix size and type are fixed.

kristoffer.carlsson · July 10, 2019, 4:49pm

CUDA calls are not blocking so you need to synchronise (CuArrays.@sync) if you want to get any sensible timings.

maleadt · July 11, 2019, 2:49pm

Seeing how all time is actually spent here in the CUBLAS gemv implementation, there’s not much to speed up from the Julia side. If possible, you could use the batched gemm interface. There’s only low-level wrappers for that call though: CuArrays.jl/blas.jl at ef72134d426f17c55ba29393dc02d818e6478599 · JuliaGPU/CuArrays.jl · GitHub

Topic		Replies	Views
Julia Cuda Matrix multiplication General Usage cudanative , cuda	3	4286	February 24, 2021
CUDA matmul performance GPU question , performance	11	1534	August 21, 2020
CLBlast, a tuned OpenCL BLAS library GPU gpu , gpuarrays	6	1536	August 9, 2018
Performance of view with cuArrays GPU	11	2693	November 11, 2018
What is the optimal way of updating CuArray? GPU cudanative	7	1533	July 5, 2018

Thousands of matrix multiplications using CuArray

Related topics