This example in CUDA.jl does not work for me?

https://cuda.juliagpu.org/stable/api/kernel/#Element-access-and-broadcasting

using Test

using CUDA

a     = rand(Float16, (16, 16))
b     = rand(Float16, (16, 16))
c     = rand(Float32, (16, 16))

a_dev = CuArray(a)
b_dev = CuArray(b)
c_dev = CuArray(c)
d_dev = similar(c_dev)

function kernel(a_dev, b_dev, c_dev, d_dev)
    conf = WMMA.Config{16, 16, 16, Float32}

    a_frag = WMMA.load_a(pointer(a_dev), 16, WMMA.ColMajor, conf)
    b_frag = WMMA.load_b(pointer(b_dev), 16, WMMA.ColMajor, conf)
    c_frag = WMMA.load_c(pointer(c_dev), 16, WMMA.ColMajor, conf)

    c_frag = 0.5f0 .* c_frag

    d_frag = WMMA.mma(a_frag, b_frag, c_frag, conf)

    WMMA.store_d(pointer(d_dev), d_frag, 16, WMMA.ColMajor, conf)

    return
end

@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
d = Array(d_dev)

@test all(isapprox.(a * b + 0.5 * c, d; rtol=0.01))

Produces the error:

ERROR: LLVM error: Cannot select: intrinsic %llvm.nvvm.wmma.m16n16k16.store.d.col.stride.f32

Any one knows the solution

Kind regards

Which Julia version are you using? My guess is that you likely need to upgrade to get a newer LLVM.

Works here on 1.8.

I gave it a shot on 1.8.0 yesterday. Will try again from a clean terminal and if it still does not work I will try updating to latest 1.8.x

Kind regards

Also make sure you’re using Julia as downloaded from the homepage or juliaup, other sources (e.g., your Linux distro) are not supported.

I’m getting the same error with Julia 1.8.3.

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 470.161.3, for CUDA 11.4
CUDA driver 11.4

Libraries:

  • CUBLAS: 11.10.1
  • CURAND: 10.2.10
  • CUFFT: 10.7.2
  • CUSOLVER: 11.3.5
  • CUSPARSE: 11.7.3
  • CUPTI: 17.0.0
  • NVML: 11.0.0+470.161.3
  • CUDNN: 8.30.2 (for CUDA 11.5.0)
  • CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:

  • Julia: 1.8.3
  • LLVM: 13.0.1
  • PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
  • Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
0: NVIDIA GeForce GTX 1070 (sm_61, 6.201 GiB / 7.921 GiB available)