Bug with Julia 1.7.1 and CUDA 3.3

Hi,
The following lines of code:

    b, r = CUDA.threadIdx().x, CUDA.blockIdx().x
    Ush = @cuStaticSharedMem(T, (D,2))

    for id1 in N:-1:1
        bu1, ru1 = up((b, r), id1, lp)
        Ush[b,1] = U[b,id1,r]

        for id2 = 1:id1-1
            bu2, ru2 = up((b, r), id2, lp)
            Ush[b,2] = U[b,id2,r]
            sync_threads()
            ipl = ipl + 1
            
            if ru2 == r
                gt2 = Ush[bu2,1]
            else
                gt2 = U[bu2,id1,ru2]
            end
           # Do some computation with gt2

should always store in the variable gt2 the quantity U[bu2,id1,ru2], with the only difference that it takes its value from the shared memory Ush when available (i.e. when ru2==r).
Unfortunately this is not the case with the latest version of Julia 1.7.1. The type T is a bit complex, but the following print statement:

CUDA.@cuprintln("[point: $b,$r; up: $bu2,$ru2, plane: $ipl]: A ",
    real(gt2.u11), " ", imag(gt2.u11), "   ", 
    real(gt2.u12), " ", imag(gt2.u12), "   ", 
    real(gt2.u13), " ", imag(gt2.u13), "   ", 
    real(gt2.u21), " ", imag(gt2.u21), "   ", 
    real(gt2.u22), " ", imag(gt2.u22), "   ", 
    real(gt2.u23), " ", imag(gt2.u23), "  ||  ",
    real(U[bu2,id1,ru2].u11), " ", imag(U[bu2,id1,ru2].u11), "   ", 
    real(U[bu2,id1,ru2].u12), " ", imag(U[bu2,id1,ru2].u12), "   ", 
    real(U[bu2,id1,ru2].u13), " ", imag(U[bu2,id1,ru2].u13), "   ", 
    real(U[bu2,id1,ru2].u21), " ", imag(U[bu2,id1,ru2].u21), "   ", 
    real(U[bu2,id1,ru2].u22), " ", imag(U[bu2,id1,ru2].u22), "   ", 
    real(U[bu2,id1,ru2].u23), " ", imag(U[bu2,id1,ru2].u23))

produces, in the latest version Julia 1.7.1:

[point: 10,10; up: 14,10, plane: 5]: A 0.115366 0.076161   -0.439492 -0.397009   0.297175 -0.505585   0.547758 0.072546   0.579036 0.520827   0.034789 -0.350083   0.471146 0.218424  ||  -0.155018 0.208531   0.734900 0.210773   -0.422495 -0.411679   -0.299919 0.233662   -0.273552 -0.565263   0.118974 -0.668538

This only happens sometimes (i.e. for some values of b,r,ipl) without any illuminating pattern. Comparing with a C implementation, Julia 1.7.1 produces wrong results, while older versions (1.6.X, 1.5.X) where producing results correct up to machine precision.

I can provide a link to a working code to reproduce the bug, but this will not be a simple piece of code…

Any advice?

Many thanks!

Try running under compute-sanitizer to see if it isn’t a bug with your implementation (a race, missing initialization, etc). You can use the one provided by CUDA.jl, see CUDA.compute_sanitizer(). See the documentation for more details, Compute Sanitizer User Manual :: Compute Sanitizer Documentation. It’s recommended to run with --launch-timeout=0 --target-processes=all --report-api-errors=no.

1 Like

Many thanks,

I have run both versions, and get 0 errors. With Julia 1.6.5:

[aramos@ciclope2 latticegpu.jl]$ /opt/nvidia/hpc_sdk/Linux_x86_64/21.7/cuda/11.4/compute-sanitizer/compute-sanitizer  --launch-timeout=0 --target-processes=all --report-api-errors=no ~/julia/julia-1.6.5/bin/julia --project=. tests/oqcd.jl -c data/test_8x8x8x8_pbcn1 -L 8 -T 8 
========= COMPUTE-SANITIZER

CUDA toolkit 11.3.1, artifact installation
CUDA driver 11.4.0
NVIDIA driver 470.57.2

Libraries: 
- CUBLAS: 11.5.1
- CURAND: 10.2.4
- CUFFT: 10.4.2
- CUSOLVER: 11.1.2
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+470.57.2
- CUDNN: 8.20.0 (for CUDA 11.3.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.6.5
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

2 devices:
  0: NVIDIA A100-PCIE-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  1: NVIDIA A100-PCIE-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  Activating environment at `~/code/latticegpu.jl/Project.toml`
 ## Analizing configuration: data/test_8x8x8x8_pbcn1
Lattice dimensions:       4
Lattice size:             8 x 8 x 8 x 8
Time boundary conditions: PERIODIC
Thread block size:        4 x 4 x 4 x 4     [256] (Number of blocks: [16])
Twist tensor: (0, 0, 0, 0, 0, 0)

# [import_cern64] Read from conf file: Int32[8, 8, 8, 8] (plaq: [1.7645057538668225])
Group:  SU3{Float64}
 - beta:              6.7
 - c0:                1.0
 - cG:                (0.0, 0.0)

 ## 
 # Plaquette: 1.764505753866823
 ## 
========= ERROR SUMMARY: 0 errors

The key check is that the value of plaq read from a finle matches what is computed (1.7645…).

If I repeat the same with julia 1.7.1 version:

[aramos@ciclope2 latticegpu.jl]$ /opt/nvidia/hpc_sdk/Linux_x86_64/21.7/cuda/11.4/compute-sanitizer/compute-sanitizer  --launch-timeout=0 --target-processes=all --report-api-errors=no ~/julia/julia-1.7.1/bin/julia --project=. tests/oqcd.jl -c data/test_8x8x8x8_pbcn1 -L 8 -T 8 
========= COMPUTE-SANITIZER
CUDA toolkit 11.3.1, artifact installation
CUDA driver 11.4.0
NVIDIA driver 470.57.2

Libraries: 
- CUBLAS: 11.5.1
- CURAND: 10.2.4
- CUFFT: 10.4.2
- CUSOLVER: 11.1.2
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+470.57.2
- CUDNN: 8.20.0 (for CUDA 11.3.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

2 devices:
  0: NVIDIA A100-PCIE-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  1: NVIDIA A100-PCIE-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  Activating project at `~/code/latticegpu.jl`
 ## Analizing configuration: data/test_8x8x8x8_pbcn1
Lattice dimensions:       4
Lattice size:             8 x 8 x 8 x 8
Time boundary conditions: PERIODIC
Thread block size:        4 x 4 x 4 x 4     [256] (Number of blocks: [16])
Twist tensor: (0, 0, 0, 0, 0, 0)

# [import_cern64] Read from conf file: Int32[8, 8, 8, 8] (plaq: [1.7645057538668225])
Group:  SU3{Float64}
 - beta:              6.7
 - c0:                1.0
 - cG:                (0.0, 0.0)

 ## 
 # Plaquette: 1.691461289655706
 ## 
========= ERROR SUMMARY: 0 errors

Everything runs fine, but the value of plaq stored in the file data/test_8x8x8x8_pbcn1 does not match the computed one. The origin of this mistmatch are the lines of code reported in my first post.

Again, many thanks,

Are you using the same version of CUDA.jl across Julia versions?

1 Like

Missed the v3.3 in the title, so I assume so. Try upgrading CUDA.jl, this may be a ptxas bug (we’ve seen a couple like it).

Ok, many thanks.

That will require a bit of work on my side. Some changes in CUDA (we discussed this here Problem with CUDAv3 - #10 by maleadt) made my long code not working with modern versions. But I will work on that and let you know.

Again many thanks!

Just wrap threadIdx etc with an inline function that casts every tuple member to Int64.

Hi,

Ok, so I have subtituted threadIdx().x by Int64(threadIdx().x) and so on. My code works now with the latest CUDA.jl version (3.6.2).

Unfortunately the problems are still present. Now not in the same place. v1.6.5 continues giving correct results for me with the new code.

Something looks buggy either in my code (but that only shows up in 1.7, despite having an exhaustive battery of tests) or in Julia 1.7.1. Unfortunately I cannot make a simple MWE that reproduces the issue.

I have observed that Julia: 1.7.1 comes with LLVM: 12.0.1, while for Julia: 1.6.5 I have LLVM: 11.0.1. Is there a way to force certain julia version to use some specified toolchain? The rest of the packages should all be the same because I use a dedicated enviroment, and I instantiate the project before any tests.

You’ll have to build your own version of Julia to test that. Luckily Julia 1.7 does still support LLVM 11, even though it ships LLVM 12; master has already dropped support for that version ([LLVM] Drop support for LLVM 11 by vchuravy · Pull Request #43371 · JuliaLang/julia · GitHub). So try building Julia with LLVM_VER=11.0.0. If that build passes your tests, you’ll have to compare the @device_code_llvm output of individual kernels.

Would it be useful to compare the output of @device_code_llvm between v1.6 and v1.7?

Thanks!

If something jumps out, then sure it can be. But actually building Julia 1.7 with LLVM 11 would confirm it’s a code generation problem, and not something else related to the Julia version difference.

Hi again,

I git clone and checkout v1.7.1. Then I create a file Make.user with the content:

LLVM_VER = 11.0.0

As suggested in the docs (Working with LLVM · The Julia Language). Unfortunately after starting julia I get:

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Linux (x86_64-unknown-linux-gnu)
  CPU: AMD Ryzen 7 PRO 4750U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver2)

(i.e. still v12). What am I doing wrong?

I have not managed to get a working v1.7 version with LLVM 11.0.1. I also took a look at @device_code_llvm from v1.6 and compared with v1.7, but this is clearly beyond my capabilities… Out of desperation, I have wrapped the kernel in a

@inbounds begin
...
end

and sudendly all results are correct. I still think that the code is correct (CuArray indices are only computed in @inline routines and these have been checked multiple times), but obviously all mistakes look unlikely until you find them. So my questions are:

  1. With this extra information, is there any way that we can find out what is going on? Would an output of @device_code_llvm with/without @inbounds help in any way? Can anyone point me what to look for?
  2. Is there anything else that I can do to make sure the bug is not in my code?

Many thanks!

Then it looks like yet another instance of https://github.com/JuliaGPU/CUDAnative.jl/issues/4, where our use of exceptions (introducing divergent control flow) breaks ptxas. I’ve never seen it happen on a sm_80 GPU though, so this would be worth reducing and reporting to NVIDIA.

Before that, could you first post a versioninfo() on the lastest version of CUDA.jl you’re using and confirm it selects toolkit version v11.5? The last one you posted was still using v11.3.1.

Hi,

I will try to produce a few tests that show the error and then we can start reducing from there.

I have a related question. I need to run this code on several clusters, and I need to ensure reproducibility. Is there a way to force the artifact installation to pick some versions? What is the standard way to ensure full reproducibility of the results?

Thanks!

You can set JULIA_CUDA_VERSION to e.g. 10.2. Currently the CUDA artifact version is managed by CUDA.jl, so isn’t part of the Manifest, which you’d otherwise use for version reproducibility.

Hi again,

I have done some extensive tests. The JULIA_CUDA_VERSION does not seem to make much difference. For me the last version that is really “working” is julia 1.5.4 with CUDAv2.4.3.

  • julia v1.7 does not allow to install CUDAv2.4.3. The precompile fails.
  • julia v1.6 allows to install CUDAv2.4.3 but running the code I get:
    Internal error: encountered unexpected error during compilation of #cached_compilation#107:
  • Any CUDAv3.3.3/CUDAv3.5.0 produces wrong results in one place or another, depending (apparently randomly) on the julia version. Not even the routine that crashes is always the same… The only pattern is that is somehow related with making repeated use of the same shared memory space with several sync_threads(). The use of @inbounds amelliorate the issue, but does not solve it.

I have a few questions:

  • Where the introduction of exceptions (divergente control flow) that you mentioned, introduced in CUDAv3?
  • My code makes heavy use of shared memory for my own defined user types. This seems to be causing the bugs. Do you have any recommendation about how to proceed? Do you make extensive use of these features?

Many thanks! I am sorry for all these questions…

Nothing of the like has been introduced in CUDA v3, but other (unrelated) changes in code generation can trigger the bug in ptxas. For example, an innocuous change to where boundschecking happens with abstract arrays in Julia 1.7 needed to be reverted in CUDA.jl: Revert an upstream change to work around bad codegen. · JuliaGPU/CUDA.jl@df08dd5 · GitHub

Reduce the bug and file an issue with NVIDIA. It’s really the only way to get this properly fixed. Can’t you isolate the kernel, provide artificial inputs and ensure the output is valid? Then we can continue and reduce the kernel – this can be done in an automated fashion, e.g. using GitHub - maleadt/creduce_julia, because we have a good & bad environment (so the script can perform a random change, and then verify the kernel executes on 1.6 but yields different results on 1.7).

One thing you could try is to make the workaround here unconditional (i.e. make sure it also applies to your device with compute capability 8.0): https://github.com/JuliaGPU/CUDA.jl/blob/605c8221e4d16fe5d219462326a455d00c54fc7d/src/device/quirks.jl#L42-L55

Did you get to try out that suggestion? It would be very valuable to know if it helps, and/or if you could create an MWE we can file with NVIDIA.