Thousands of matrix multiplications using CuArray

Using CuArray.jl it was possible to improve the following code at about 7x when compare to the analogue computation executed on host. However, it is still disappointing for my purposes (53.122 s running on a Tesla K40c with julia1.0). I have also insert the profiling result below. Any suggestions on how to improve performance would be welcome.

using CuArrays
using CuArrays.CUBLAS
using CUDAdrv
using BenchmarkTools

function times(d_A::CuArray{Float32,2}, 
	       d_B::CuArray{Float32,1}, 
	       d_C::CuArray{Float32,1})

	CuArrays.CUBLAS.gemv!('N',1f0,d_A,d_B,0f0,d_C)
end

n1 = 2001; n2 = 400; n3 = 700

d_A = CuArray(rand(Float32,n1*n2,n3))
d_B = cu(rand(Float32,n3))
d_C = cu(zeros(Float32,n1*n2))

@btime for i=1:n1
	times(d_A, d_B, d_C)
end
==140719== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  214.082s      8004  26.747ms  26.672ms  26.807ms  void gemv2N_kernel_val<float, float, float, int=128, int=4, int=4, int=4, int=1, cublasGemvParams<cublasGemvTensor<float const >, cublasGemvTensor<float>, float>>(float, float, float const )
      API calls:  100.00%  212.814s      8004  26.588ms  5.0930us  26.9696s  cudaLaunchKernel
                    0.00%  2.5847ms      8004     322ns     145ns  463.36us  cudaGetLastError
                    0.00%  2.2860us         1  2.2860us  2.2860us  2.2860us  cuDeviceGetCount

1 Like

sorry, why do you want to run times 2001 times?

I guess a natural question is, how fast do you expect it to be? to me it looks like you’ve already moved everything onto GPU and they are already Float32, so there’s no faster route. Do you know how much time it takes in, for example, Tensorflow? with GPU?

1 Like

If I run the excacly same size matrix just once instead of the 2001 times I get 10 microsec (using BenchmarkTools). Is it a very incorrect assumption to expect that the time would increase linearly with the number of computations?

I actually need to run 5001 times or more this matrix vector multiplication, it stands for time index in a wavefield propagation .

Thank you for your input

@btime would repeat whatever thing you’re calculating many times and take average (and neglect compile time), as a standard practice, you should wrap things in a function and call @btime on that function.

Unless there are overheads in IO (or allocation), matrix multiplication time is absolutely linear — your matrix size and type are fixed.

1 Like

CUDA calls are not blocking so you need to synchronise (CuArrays.@sync) if you want to get any sensible timings.

1 Like

Seeing how all time is actually spent here in the CUBLAS gemv implementation, there’s not much to speed up from the Julia side. If possible, you could use the batched gemm interface. There’s only low-level wrappers for that call though: CuArrays.jl/blas.jl at ef72134d426f17c55ba29393dc02d818e6478599 · JuliaGPU/CuArrays.jl · GitHub

4 Likes