Using CuArray.jl it was possible to improve the following code at about 7x when compare to the analogue computation executed on host. However, it is still disappointing for my purposes (53.122 s running on a Tesla K40c with julia1.0). I have also insert the profiling result below. Any suggestions on how to improve performance would be welcome.
using CuArrays
using CuArrays.CUBLAS
using CUDAdrv
using BenchmarkTools
function times(d_A::CuArray{Float32,2},
n1 = 2001; n2 = 400; n3 = 700
d_A = CuArray(rand(Float32,n1*n2,n3))
d_B = cu(rand(Float32,n3))
d_C = cu(zeros(Float32,n1*n2))
@btime for i=1:n1
times(d_A, d_B, d_C)
==140719== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 214.082s 8004 26.747ms 26.672ms 26.807ms void gemv2N_kernel_val<float, float, float, int=128, int=4, int=4, int=4, int=1, cublasGemvParams<cublasGemvTensor<float const >, cublasGemvTensor<float>, float>>(float, float, float const )
API calls: 100.00% 212.814s 8004 26.588ms 5.0930us 26.9696s cudaLaunchKernel
0.00% 2.5847ms 8004 322ns 145ns 463.36us cudaGetLastError
0.00% 2.2860us 1 2.2860us 2.2860us 2.2860us cuDeviceGetCount