CUDA v2 - performance regression on matrix multiplication

Following upgrade of Flux from CUDA 1 to CUDA 2, operations running on CUDA appear to allocate more as well as being slightly slower.

For example, on a LSTM, with CUDA 2.1 :

8.572 s (16390042 allocations: 3.92 GiB)

CUDA 1.3.3:

8.102 s (18654053 allocations: 580.29 MiB)

Could it be that the memory allocations from CUDNN calls were simply not captured previously?

With a more basic Dense model, the allocations with CUDA 2.1 are now slightly lower, but is clearly slower than with 1.3.3, which seems to hint to allocations not captured by CUDNN call, but still pointing to performance regression with newest CUDA 2.1.

CUDA 2.1:

245.949 ms (119956 allocations: 3.53 MiB)

CUDA 1.3.3:

144.685 ms (127057 allocations: 3.74 MiB)

To reproduce, I used Flux v0.11.2 (comes with CUDA 2.1). And for the CUDA 1.3.3 comparison, I used the same Flux v0.11.2, removed CUDA and add CUDA#v1.3.3.

Code for the Dense example:

using Revise
using Flux
using Statistics: mean
using Random: seed!

# illustrate diverging behavior of GPU execution
seed!(123)
feat = 64
hidden = 256
batch_size = 1024

m_cpu = Chain(Dense(feat, hidden, relu),
    Dense(hidden, hidden, relu),
    Dense(hidden, 1))

X = rand(Float32, feat, batch_size)
Y = rand(Float32, batch_size) ./ 10

m_gpu = m_cpu |> gpu
X_gpu = gpu(X)
Y_gpu = gpu(Y)

θ_gpu = Flux.params(m_gpu)

function loss_gpu(x, y)
    l = mean((m_gpu(x) .- y).^2)
    return l
end

opt_gpu = Descent(1e-3)

function speed_gpu(n=10)
    for i in 1:n
        Flux.train!(loss_gpu, θ_gpu, [(X_gpu, Y_gpu)], opt_gpu)
    end
    return loss_gpu(X_gpu, Y_gpu)
end

using BenchmarkTools
@btime speed_gpu(100)
4 Likes

At a more granular level, the slowdown appears right at the matrix multiplication level:

using CUDA
using BenchmarkTools
x1, x2 = CuArray(rand(Float32, 128, 256)), CuArray(rand(Float32, 256, 1024))

function mul(x,y)
  x * y
end

@btime mul(x1, x2)

CUDA 2.1:
14.899 μs (19 allocations: 384 bytes)

CUDA 1.3.3:
10.999 μs (6 allocations: 320 bytes)

Have you tried adding @sync before the benchmarked expression?

FYI:

Indeed, @sync is the culprit here. On 1.3.3:

julia> @benchmark $x1 * $x2
BenchmarkTools.Trial: 
  memory estimate:  320 bytes
  allocs estimate:  6
  --------------
  minimum time:     5.574 μs (0.00% GC)
  median time:      15.394 μs (0.00% GC)
  mean time:        15.240 μs (0.82% GC)
  maximum time:     3.256 ms (22.12% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> @benchmark CUDA.@sync $x1 * $x2
BenchmarkTools.Trial: 
  memory estimate:  416 bytes
  allocs estimate:  11
  --------------
  minimum time:     41.696 μs (0.00% GC)
  median time:      48.476 μs (0.00% GC)
  mean time:        49.554 μs (0.00% GC)
  maximum time:     225.519 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

On master:

julia> @benchmark $x1 * $x2
BenchmarkTools.Trial: 
  memory estimate:  384 bytes
  allocs estimate:  19
  --------------
  minimum time:     8.850 μs (0.00% GC)
  median time:      14.673 μs (0.00% GC)
  mean time:        14.828 μs (1.08% GC)
  maximum time:     7.586 ms (21.17% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark CUDA.@sync $x1 * $x2
BenchmarkTools.Trial: 
  memory estimate:  480 bytes
  allocs estimate:  24
  --------------
  minimum time:     35.176 μs (0.00% GC)
  median time:      53.785 μs (0.00% GC)
  mean time:        54.383 μs (0.00% GC)
  maximum time:     241.566 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

There’s some more complicated run-time dispatch happening now, which explains the higher execution time without @sync, but the actual performance is higher.

3 Likes

My bad, I effectively missed some benchmarking best practices.
However, after those fixes, I’m still observing a regression:

CUDA 1.3.3:

julia> @benchmark $x1 * $x2
BenchmarkTools.Trial: 
  memory estimate:  320 bytes
  allocs estimate:  6
  --------------
  minimum time:     10.800 μs (0.00% GC)
  median time:      11.300 μs (0.00% GC)
  mean time:        24.375 μs (0.96% GC)
  maximum time:     12.055 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
julia> @benchmark CUDA.@sync $x1 * $x2
BenchmarkTools.Trial: 
  memory estimate:  416 bytes
  allocs estimate:  11
  --------------
  minimum time:     130.599 μs (0.00% GC)
  median time:      136.699 μs (0.00% GC)
  mean time:        141.856 μs (0.19% GC)
  maximum time:     11.526 ms (23.41% GC)
  --------------
  samples:          10000
  evals/sample:     1
  [1520ce14] AbstractTrees v0.3.3
  [79e6a3ab] Adapt v2.3.0
  [052768ef] CUDA v1.3.3 `https://github.com/JuliaGPU/CUDA.jl.git#v1.3.3`
  [944b1d66] CodecZlib v0.7.0
  [5ae59095] Colors v0.12.4
  [d9f16b24] Functors v0.1.0
  [e5e0dc1b] Juno v0.8.4
  [1914dd2f] MacroTools v0.5.6
  [872c559c] NNlib v0.7.5
  [189a3867] Reexport v0.2.0
  [2913bbd2] StatsBase v0.33.2
  [a5390f91] ZipFile v0.9.3
  [e88e6eb3] Zygote v0.5.9
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg
  [de0858da] Printf
  [9a3f8284] Random
  [ea8e919c] SHA
  [10745b16] Statistics
  [8dfed614] Test

CUDA 2.1.0:

julia> @benchmark $x1 * $x2
BenchmarkTools.Trial: 
  memory estimate:  384 bytes
  allocs estimate:  19
  --------------
  minimum time:     14.800 μs (0.00% GC) 
  median time:      15.700 μs (0.00% GC) 
  mean time:        118.945 μs (0.23% GC)
  maximum time:     14.182 ms (0.00% GC) 
  --------------
  samples:          10000
  evals/sample:     1
julia> @benchmark CUDA.@sync $x1 * $x2
BenchmarkTools.Trial: 
  memory estimate:  480 bytes
  allocs estimate:  24
  --------------
  minimum time:     230.599 μs (0.00% GC)
  median time:      238.301 μs (0.00% GC)
  mean time:        243.557 μs (0.11% GC)
  maximum time:     11.614 ms (23.76% GC)
  --------------
  [1520ce14] AbstractTrees v0.3.3
  [79e6a3ab] Adapt v2.3.0
  [052768ef] CUDA v2.1.0
  [944b1d66] CodecZlib v0.7.0
  [5ae59095] Colors v0.12.4
  [d9f16b24] Functors v0.1.0
  [e5e0dc1b] Juno v0.8.4
  [1914dd2f] MacroTools v0.5.6
  [872c559c] NNlib v0.7.5
  [189a3867] Reexport v0.2.0
  [2913bbd2] StatsBase v0.33.2
  [a5390f91] ZipFile v0.9.3
  [e88e6eb3] Zygote v0.5.9
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg
  [de0858da] Printf
  [9a3f8284] Random
  [ea8e919c] SHA
  [10745b16] Statistics
  [8dfed614] Test

I’m running on a GTX1660.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 456.71       Driver Version: 456.71       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 166... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8    15W /  N/A |    153MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Could you enable logging to see which functions are called? JULIA_DEBUG=CUBLAS should do it on 2.0, but I don’t think 1.3 has that, so:

julia> using CUDA
[ Info: Precompiling CUDA [052768ef-5323-5732-b1bb-66c8b64840ba]

julia> CUBLAS.cublasLoggerConfigure(1, 0, 1, C_NULL)

julia> x1, x2 = CuArray(rand(Float32, 128, 256)), CuArray(rand(Float32, 256, 1024))
(Float32[0.02894938 0.659762 … 0.08572972 0.21148372; 0.9676044 0.77993274 … 0.8160199 0.9195348; … ; 0.71437204 0.6493443 … 0.19146216 0.315194; 0.25043893 0.5365975 … 0.78834224 0.004127383], Float32[0.4194752 0.14759946 … 0.545315 0.8390795; 0.120224 0.84428596 … 0.89420867 0.11571264; … ; 0.6441511 0.37211454 … 0.6095965 0.7840842; 0.6322019 0.35158968 … 0.61961055 0.7781992])

julia> x1 * x2
I! cuBLAS (v11.1) function cublasStatus_t cublasCreate_v2(cublasContext**) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0x7ff8256ba800)
i! Time: 2020-11-09T18:07:59 elapsed from start 0.133333 minutes or 8.000000 seconds
i!Process=121909; Thread=140705011352128; GPU=0; Handle=POINTER (IN HEX:0x(nil))
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)
I! cuBLAS (v11.1) function cublasStatus_t cublasSetStream_v2(cublasHandle_t, cudaStream_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xa936810)
i!  streamId: type=SOME TYPE; val=POINTER (IN HEX:0x0x2)
i! Time: 2020-11-09T18:08:00 elapsed from start 0.150000 minutes or 9.000000 seconds
i!Process=121909; Thread=140705011352128; GPU=0; Handle=POINTER (IN HEX:0x0xa936810); StreamId=POINTER (IN HEX:0x(nil)) (defaultStream); MathMode=CUBLAS_DEFAULT_MATH
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)
I! cuBLAS (v11.1) function cublasStatus_t cublasSetWorkspace_v2(cublasHandle_t, void*, size_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xa936810)
i!  workspace: type=void; val=POINTER (IN HEX:0x0x7ff707800000)
i!  workspaceSizeInBytes: type=SOME TYPE; val=4194304
i! Time: 2020-11-09T18:08:00 elapsed from start 0.150000 minutes or 9.000000 seconds
i!Process=121909; Thread=140705011352128; GPU=0; Handle=POINTER (IN HEX:0x0xa936810); StreamId=POINTER (IN HEX:0x0x2); MathMode=CUBLAS_DEFAULT_MATH
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)
I! cuBLAS (v11.1) function cublasStatus_t cublasSetMathMode(cublasHandle_t, cublasMath_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xa936810)
i!  mode: type=cublasMath_t; val=CUBLAS_TENSOR_OP_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION(17)
i! Time: 2020-11-09T18:08:00 elapsed from start 0.150000 minutes or 9.000000 seconds
i!Process=121909; Thread=140705011352128; GPU=0; Handle=POINTER (IN HEX:0x0xa936810); StreamId=POINTER (IN HEX:0x0x2); MathMode=CUBLAS_DEFAULT_MATH
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)
I! cuBLAS (v11.1) function cublasStatus_t cublasGemmEx(cublasHandle_t, cublasOperation_t, cublasOperation_t, int, int, int, const void*, const void*, cudaDataType_t, int, const void*, cudaDataType_t, int, const void*, void*, cudaDataType_t, int, cublasComputeType_t, cublasGemmAlgo_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xa936810)
i!  transa: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  transb: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  m: type=int; val=128
i!  n: type=int; val=1024
i!  k: type=int; val=256
i!  alpha: type=void; val=POINTER (IN HEX:0x0x7ff821de0490)
i!  A: type=void; val=POINTER (IN HEX:0x0x7ff712c00000)
i!  Atype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  lda: type=int; val=128
i!  B: type=void; val=POINTER (IN HEX:0x0x7ff712c20000)
i!  Btype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldb: type=int; val=256
i!  beta: type=void; val=POINTER (IN HEX:0x0x7ff821de04a0)
i!  C: type=void; val=POINTER (IN HEX:0x0x7ff712d20000)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldc: type=int; val=128
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F(68)
i!  algo: type=SOME TYPE; val=CUBLAS_GEMM_DEFAULT(-1)
i! Time: 2020-11-09T18:08:00 elapsed from start 0.150000 minutes or 9.000000 seconds
i!Process=121909; Thread=140705011352128; GPU=0; Handle=POINTER (IN HEX:0x0xa936810); StreamId=POINTER (IN HEX:0x0x2); MathMode=CUBLAS_TENSOR_OP_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)

2 key differences noticed are: CUDA.jl v1.3.3 uses CUDA 11.0 while CUDA 2.1.0 uses CUDA 11.1.
Plus, different functions seem to be called: cublasSgemm_v2 for v.1.3.3 and cublasGemmEx in v2.1.0.

From a quick read, SGEMM would be a specialized on single precision. And that specialization would not be picked up under v2.1.0? However, that wouldn’t explain why you’d get the same function called on your side on both v1.3.3 and v2.1.0. Long shot hypothesis: the compilation from MS Visual Studio?

CUDA 1.3.3:

julia> x1 * x2
I! cuBLAS (v11.0) function cublasStatus_t __stdcall cublasCreate_v2(cublasContext **) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x000000000BA2CA10)
i! Time: 2020-11-09T20:47:09 elapsed from start 0.100000 minutes or 6.000000 seconds
i!Process=16420; Thread=12072; GPU=0; Handle=POINTER (IN HEX:0x0000000000000000)
i! COMPILED WITH: Microsoft Visual Studio / 192027508.1
I! cuBLAS (v11.0) function cublasStatus_t __stdcall cublasSgemm_v2(cublasContext *, cublasOperation_t, cublasOperation_t, int, int, int, const float *, const float *, int, const float *, int, const float *, float *, int) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x000000007CF26010)
i!  transa: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  transb: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  m: type=int; val=128
i!  n: type=int; val=1024
i!  k: type=int; val=256
i!  alpha: type=float; val=POINTER (IN HEX:0x000000000CC8D490)
i!  A: type=float; val=POINTER (IN HEX:0x0000000703E00000)
i!  lda: type=int; val=128
i!  B: type=float; val=POINTER (IN HEX:0x0000000703E20000)
i!  ldb: type=int; val=256
i!  beta: type=float; val=POINTER (IN HEX:0x000000000CC8D6D0)
i!  C: type=float; val=POINTER (IN HEX:0x0000000703F20000)
i!  ldc: type=int; val=128
i! Time: 2020-11-09T20:47:10 elapsed from start 0.116667 minutes or 7.000000 seconds
i!Process=16420; Thread=12072; GPU=0; Handle=POINTER (IN HEX:0x000000007CF26010); StreamId=POINTER (IN HEX:0x0000000000000000) (defaultStream); MathMode=CUBLAS_DEFAULT_MATH
i! COMPILED WITH: Microsoft Visual Studio / 192027508.1
128×1024 CuArray{Float32,2}:

CUDA 2.1.0:

julia> x1 * x2
I! cuBLAS (v11.1) function cublasStatus_t __stdcall cublasGemmEx(cublasContext *, cublasOperation_t, cublasOperation_t, int, int, int, const void *, const void *, cudaDataType_t, int, const void *, cudaDataType_t, int, const void *, void *, cudaDataType_t, int, cublasComputeType_t, cublasGemmAlgo_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x000000008B914F10)
i!  transa: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  transb: type=cublasOperation_t; val=CUBLAS_OP_N(0)
i!  m: type=int; val=128
i!  n: type=int; val=1024
i!  k: type=int; val=256
i!  alpha: type=void; val=POINTER (IN HEX:0x000000003F55AF30)
i!  A: type=void; val=POINTER (IN HEX:0x0000000703F40000)
i!  Atype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  lda: type=int; val=128
i!  B: type=void; val=POINTER (IN HEX:0x0000000704100000)
i!  Btype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldb: type=int; val=256
i!  beta: type=void; val=POINTER (IN HEX:0x000000003F55AF40)
i!  C: type=void; val=POINTER (IN HEX:0x0000000703E00000)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  ldc: type=int; val=128
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F(68)
i!  algo: type=SOME TYPE; val=CUBLAS_GEMM_DEFAULT(-1)
i! Time: 2020-11-09T20:40:28 elapsed from start 535.250000 minutes or 32115.000000 seconds
i!Process=13808; Thread=17260; GPU=0; Handle=POINTER (IN HEX:0x000000008B914F10); StreamId=POINTER (IN HEX:0x0000000000000002); MathMode=CUBLAS_TENSOR_OP_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION
i! COMPILED WITH: Microsoft Visual Studio / 192027508.1
128×1024 CuArray{Float32,2}:

I wanted to verify you were calling cublasGemmEx and not the GPUArrays fallback (which I guess would be even slower). This routine is specialized too, but just offers a generic entry-point where you pass the datatypes as arguments (instead of using different functions).

I guess we could still use the old methods for now, but I presume they’ll get deprecated at some point. Could you run under a profiler to confirm the loss of performance is due to the actual API call, and not some Julia host code?

1.3.3

julia> exit()
==13268== Profiling application: julia
==13268== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.03%  55.457us         1  55.457us  55.457us  55.457us  volta_sgemm_32x32_sliced1x4_nn
                    2.97%  1.6960us         1  1.6960us  1.6960us  1.6960us  [CUDA memcpy HtoD]
      API calls:   99.91%  1.58954s         3  529.85ms  1.1000us  984.31ms  cudaFree
                    0.06%  1.0030ms         4  250.75us  8.4000us  928.50us  cudaMalloc
                    0.01%  86.000us        88     977ns     300ns  10.600us  cudaFuncSetAttribute
                    0.00%  62.200us         1  62.200us  62.200us  62.200us  cudaLaunchKernel
                    0.00%  56.300us         2  28.150us  13.300us  43.000us  cuDeviceTotalMem
                    0.00%  34.500us        18  1.9160us     700ns  14.700us  cudaEventCreateWithFlags
                    0.00%  31.700us         1  31.700us  31.700us  31.700us  cudaMemcpy
                    0.00%  28.400us       196     144ns     100ns     800ns  cuDeviceGetAttribute
                    0.00%  18.200us         1  18.200us  18.200us  18.200us  cuMemAlloc
                    0.00%  14.600us         1  14.600us  14.600us  14.600us  cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
                    0.00%  6.9000us         3  2.3000us     300ns  6.2000us  cuDeviceGetCount
                    0.00%  4.7000us        12     391ns     200ns  1.3000us  cudaDeviceGetAttribute
                    0.00%  4.1000us         2  2.0500us     900ns  3.2000us  cuInit
                    0.00%  1.5000us         1  1.5000us  1.5000us  1.5000us  cudaGetDevice
                    0.00%  1.4000us         1  1.4000us  1.4000us  1.4000us  cuCtxGetCurrent
                    0.00%  1.3000us         2     650ns     600ns     700ns  cuDeviceGetName
                    0.00%  1.0000us         2     500ns     300ns     700ns  cuDriverGetVersion
                    0.00%     500ns         2     250ns     200ns     300ns  cuDeviceGet
                    0.00%     500ns         2     250ns     200ns     300ns  cuDeviceGetLuid
                    0.00%     400ns         2     200ns     200ns     200ns  cuDeviceGetUuid
                    0.00%     300ns         1     300ns     300ns     300ns  cudaGetLastError

2.1.0

julia> exit()
==1444== Profiling application: julia
==1444== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   93.56%  210.95us         1  210.95us  210.95us  210.95us  volta_s884gemm_128x128_ldg8_f2f_nn
                    5.80%  13.088us         1  13.088us  13.088us  13.088us  void splitKreduce_kernel<float, float, float, float>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *, float const *)
                    0.64%  1.4400us         1  1.4400us  1.4400us  1.4400us  [CUDA memcpy HtoD]
      API calls:   99.86%  1.05177s         3  350.59ms  1.3000us  688.71ms  cudaFree
                    0.08%  806.50us         4  201.63us  3.3000us  766.00us  cudaMalloc
                    0.04%  423.50us         2  211.75us  18.100us  405.40us  cuMemAlloc
                    0.01%  88.100us         2  44.050us  9.0000us  79.100us  cudaLaunchKernel
                    0.00%  48.300us         2  24.150us  15.700us  32.600us  cuDeviceTotalMem
                    0.00%  45.300us        88     514ns     300ns  3.4000us  cudaFuncSetAttribute
                    0.00%  32.300us       200     161ns     100ns  1.5000us  cuDeviceGetAttribute
                    0.00%  24.800us         1  24.800us  24.800us  24.800us  cudaMemcpy
                    0.00%  17.800us        18     988ns     400ns  6.5000us  cudaEventCreateWithFlags
                    0.00%  8.8000us         1  8.8000us  8.8000us  8.8000us  cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
                    0.00%  6.3000us         3  2.1000us     300ns  5.3000us  cuDeviceGetCount
                    0.00%  5.2000us         2  2.6000us     600ns  4.6000us  cuInit
                    0.00%  4.7000us        12     391ns     200ns  1.3000us  cudaDeviceGetAttribute
                    0.00%  1.7000us         2     850ns     300ns  1.4000us  cuCtxGetCurrent
                    0.00%  1.5000us         2     750ns     300ns  1.2000us  cuDriverGetVersion
                    0.00%  1.3000us         2     650ns     500ns     800ns  cuDeviceGetName
                    0.00%  1.3000us         1  1.3000us  1.3000us  1.3000us  cudaGetDevice
                    0.00%     600ns         2     300ns     300ns     300ns  cuDeviceGet
                    0.00%     500ns         2     250ns     200ns     300ns  cuDeviceGetLuid
                    0.00%     400ns         2     200ns     200ns     200ns  cuDeviceGetUuid
                    0.00%     300ns         2     150ns     100ns     200ns  cudaGetLastError

If I’m interpreting correctly, there’s an almost 4x time difference between the kernels doing the actual multiplication (55us vs 210us)? That looks like a suspiciously large gap, isn’t it?

Would the difference in methods used have to do with CUDA 11.0/11.1 itself, or more about how the dispatch is handled by CUDA.jl? And just to confirm, you don’t have this difference in performance or in the cublas function invoked between 1.3.3 and 2.1.0 on you side?

Yeah 2.x is consistently faster on my system, as reported above.

For the profiling results, better run the multiplication a couple of times to average out timings. This large difference could be a fluke.

Running 10 times, it still shows significant difference:

1.3.3:

==2856== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.67%  520.52us        10  52.052us  51.392us  55.585us  volta_sgemm_32x32_sliced1x4_nn
                    0.33%  1.7280us         1  1.7280us  1.7280us  1.7280us  [CUDA memcpy HtoD]

2.1.0:

==10276== Profiling application: julia
==10276== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   94.13%  2.1096ms        10  210.96us  210.60us  211.62us  volta_s884gemm_128x128_ldg8_f2f_nn
                    5.81%  130.21us        10  13.020us  12.736us  13.345us  void splitKreduce_kernel<float, float, float, float>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *, float const *

The 2.1.0 function s884gemm seems related to tensor cores operation, whereas being on a GTX 1660, I don’t think there’s any, maybe it could results in some overhead?

So is what I’m experiencing looking like a bug?

Possibly. Could you test with CUDA#master, which should use CUDA 11.1 Update 1 (verify with versioninfo())?

Unfortunately, no luck with latest drivers and CUDA#master / CUDA 11.1.1, same execution time as before.

julia> CUDA.versioninfo()
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.1.0
NVIDIA driver 457.30.0

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+457.30
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.5.2
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

1 device:
  0: GeForce GTX 1660 Ti with Max-Q Design (sm_75, 5.152 GiB / 6.000 GiB available)
==19444== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   94.01%  2.1188ms        10  211.88us  211.59us  212.42us  volta_s884gemm_128x128_ldg8_f2f_nn
                    5.99%  134.92us        10  13.491us  13.088us  13.920us  void splitKreduce_kernel<float, float, float, float>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *, float const *)

Another possibility is that we’re misconfiguring the math mode, or at least doing something less performant than cublasSgemm was doing before. Could you open an issue on CUDA.jl? Please include the CUBLAS debug output and benchmark/profiling results for both cases. I’ll try and reproduce on another system if I have some time.

Sure I’ll do, thanks a lot for you time, much appreciated!
I shall also be able to make a test on another machine on Ubuntu and RTX 2080.