Bug with Julia 1.7.1 and CUDA 3.3

Sorry for not following up. I have been caught in a project application with a short deadline and I am a bit overwhelmed…

If I understand what you propose correctly, it is enough to add at the beginning of my codes and after loading CUDA something like:

    @device_override Base.@propagate_inbounds function Base.getindex(iter::CartesianIndices{N,R},
                                                                     I::Vararg{Int, N}) where {N,R}
            CartesianIndex(getindex.(iter.indices, I))
    end

Is my understanding correct? If yes, I will try this as early as I can!

It’s easier to dev CUDA.jl and change the package.

Hi,

I tried removing the branch and executing always the code for compute_capability() < sv"7" and it still show bugs. It does not seem to affect the issue.

I will try to produce a cleaned version that shows the bugs.

You can also try CUDA.jl#master which now uses an updated ptxas from CUDA 11.6 (verify using CUDA.versioninfo()).

Hi Again,

I have been producing results with the old CUDA version until now (was in a rush, and nothing seemed to help). I have decided to try again, and have found something interesting.

In our local cluster there are two partitions: one for production runs (V100,A100), and one small interactive partition for tests. I have found that with the last julia version (1.7.2 and CUDAv3.3.3), the code runs fine (without bugs) in the interactive partition. On the other hand, in the production partition, the code runs but produces wrong results.

Here is an sdiff of both runs:

Julia Version 1.7.2						Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)			Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:							Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)				  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz		      |	  CPU: AMD EPYC 7532 32-Core Processor
  WORD_SIZE: 64							  WORD_SIZE: 64
  LIBM: libopenlibm						  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512)		      |	  LLVM: libLLVM-12.0.1 (ORCJIT, znver2)
Environment:							Environment:
  JULIA_GPG = 3673DF529D9049477F76B37566E3C7DC03D6E495		  JULIA_GPG = 3673DF529D9049477F76B37566E3C7DC03D6E495
  JULIA_PATH = /usr/local/julia					  JULIA_PATH = /usr/local/julia
  JULIA_DEPOT_PATH = /lustre/alberto/julia_packages		  JULIA_DEPOT_PATH = /lustre/alberto/julia_packages
  JULIA_VERSION = 1.7.2						  JULIA_VERSION = 1.7.2
      Status `/lustre/alberto/code/test_latticegpu.jl/Project	      Status `/lustre/alberto/code/test_latticegpu.jl/Project
  [5e92007d] ADerrors v0.1.0 `https://gitlab.ift.uam-csic.es/	  [5e92007d] ADerrors v0.1.0 `https://gitlab.ift.uam-csic.es/
  [c7e460c6] ArgParse v1.1.4					  [c7e460c6] ArgParse v1.1.4
  [375f315e] BDIO v0.1.0 `https://gitlab.ift.uam-csic.es/albe	  [375f315e] BDIO v0.1.0 `https://gitlab.ift.uam-csic.es/albe
  [052768ef] CUDA v3.3.3					  [052768ef] CUDA v3.3.3
  [944b1d66] CodecZlib v0.7.0					  [944b1d66] CodecZlib v0.7.0
  [958c3683] LatticeGPU v0.1.0 `https://igit.ific.uv.es/alram	  [958c3683] LatticeGPU v0.1.0 `https://igit.ific.uv.es/alram
  [91a5bcdd] Plots v1.23.5					  [91a5bcdd] Plots v1.23.5
  [a759f4b9] TimerOutputs v0.5.13				  [a759f4b9] TimerOutputs v0.5.13
  [3bb67fe8] TranscodingStreams v0.9.6				  [3bb67fe8] TranscodingStreams v0.9.6
  [b77e0a4c] InteractiveUtils					  [b77e0a4c] InteractiveUtils
  [44cfe95a] Pkg						  [44cfe95a] Pkg
  [de0858da] Printf						  [de0858da] Printf
  [9a3f8284] Random						  [9a3f8284] Random
  [fa267f1f] TOML						  [fa267f1f] TOML
CUDA toolkit 11.3.1, artifact installation			CUDA toolkit 11.3.1, artifact installation
CUDA driver 11.5.0						CUDA driver 11.5.0
NVIDIA driver 495.29.5						NVIDIA driver 495.29.5

Libraries: 							Libraries: 
- CUBLAS: 11.5.1						- CUBLAS: 11.5.1
- CURAND: 10.2.4						- CURAND: 10.2.4
- CUFFT: 10.4.2							- CUFFT: 10.4.2
- CUSOLVER: 11.1.2						- CUSOLVER: 11.1.2
- CUSPARSE: 11.6.0						- CUSPARSE: 11.6.0
- CUPTI: 14.0.0							- CUPTI: 14.0.0
- NVML: 11.0.0+495.29.5						- NVML: 11.0.0+495.29.5
- CUDNN: 8.20.0 (for CUDA 11.3.0)				- CUDNN: 8.20.0 (for CUDA 11.3.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)				- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:							Toolchain:
- Julia: 1.7.2							- Julia: 1.7.2
- LLVM: 12.0.1							- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.	- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_5	- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_5

1 device:							1 device:
  0: Tesla P100-PCIE-12GB (sm_60, 11.910 GiB / 11.912 GiB ava |	  0: NVIDIA A100-PCIE-40GB (sm_80, 39.583 GiB / 39.586 GiB av
 ## Analizing configuration: /lustre/alberto/code/test_lattic	 ## Analizing configuration: /lustre/alberto/code/test_lattic
Lattice dimensions:       4					Lattice dimensions:       4
Lattice size:             8 x 8 x 8 x 8				Lattice size:             8 x 8 x 8 x 8
Time boundary conditions: PERIODIC				Time boundary conditions: PERIODIC
Thread block size:        4 x 4 x 4 x 4     [256] (Number of 	Thread block size:        4 x 4 x 4 x 4     [256] (Number of 
Twist tensor: (0, 0, 0, 0, 0, 0)				Twist tensor: (0, 0, 0, 0, 0, 0)

# [import_cern64] Read from conf file: Int32[8, 8, 8, 8] (pla	# [import_cern64] Read from conf file: Int32[8, 8, 8, 8] (pla
Group:  SU3{Float64}						Group:  SU3{Float64}
 - beta:              6.7					 - beta:              6.7
 - c0:                1.0					 - c0:                1.0
 - cG:                (0.0, 0.0)				 - cG:                (0.0, 0.0)

 ## 								 ## 
 # Plaquette: 1.770069921558924				      |	 # Plaquette: 1.6244204140351874
 ## 								 ## 
WILSON flow integrator						WILSON flow integrator
 * Two stage scheme. Coefficients:				 * Two stage scheme. Coefficients:
    stg 1: -0.4722222222222222 0.8888888888888888		    stg 1: -0.4722222222222222 0.8888888888888888
    stg 2: -1.0 0.75						    stg 2: -1.0 0.75
 * Fixed step size parameters: eps = 0.01			 * Fixed step size parameters: eps = 0.01
 * Adaptive step size parameters: tol = 1.0e-7			 * Adaptive step size parameters: tol = 1.0e-7
    - max eps:      0.1						    - max eps:      0.1
    - initial eps:  0.005					    - initial eps:  0.005
    - safety scale: 0.9						    - safety scale: 0.9

     FLOW   t=  0.0000:    6.045352321554e+04    8.3610520410 |	     FLOW   t=  0.0000:    6.045352321554e+04    8.3807340866
     FLOW   t=  0.1000:    2.672729605197e+04    6.2484621765 |	     FLOW   t=  0.1000:    2.672729605197e+04    6.3127741601
     FLOW   t=  0.2000:    1.221783495047e+04    4.1479042540 |	     FLOW   t=  0.2000:    1.221783495047e+04    4.2806249807
     FLOW   t=  0.3000:    6.293795491842e+03    2.7906809002 |	     FLOW   t=  0.3000:    6.293795491842e+03    2.9780991633
     FLOW   t=  0.4000:    3.696042514745e+03    1.9883567766 |	     FLOW   t=  0.4000:    3.696042514745e+03    2.2132290274
     FLOW   t=  0.5000:    2.435031662979e+03    1.5047723103 |	     FLOW   t=  0.5000:    2.435031662979e+03    1.7546378274
     FLOW   t=  0.6000:    1.757791223073e+03    1.1983335066 |	     FLOW   t=  0.6000:    1.757791223073e+03    1.4651451702
     FLOW   t=  0.7000:    1.359990889456e+03    9.9351194056 |	     FLOW   t=  0.7000:    1.359990889456e+03    1.2721325098
     FLOW   t=  0.8000:    1.107906613390e+03    8.4986022984 |	     FLOW   t=  0.8000:    1.107906613390e+03    1.1369505321
     FLOW   t=  0.9000:    9.377549064903e+02    7.4486486487 |	     FLOW   t=  0.9000:    9.377549064903e+02    1.0382003125
     FLOW   t=  1.0000:    8.167956501505e+02    6.6541140425 |	     FLOW   t=  1.0000:    8.167956501505e+02    9.6347265001
ZEUTHEN flow integrator						ZEUTHEN flow integrator
 * Two stage scheme. Coefficients:				 * Two stage scheme. Coefficients:
    stg 1: -0.4722222222222222 0.8888888888888888		    stg 1: -0.4722222222222222 0.8888888888888888
    stg 2: -1.0 0.75						    stg 2: -1.0 0.75
 * Fixed step size parameters: eps = 0.01			 * Fixed step size parameters: eps = 0.01
 * Adaptive step size parameters: tol = 1.0e-7			 * Adaptive step size parameters: tol = 1.0e-7
    - max eps:      0.1						    - max eps:      0.1
    - initial eps:  0.005					    - initial eps:  0.005
    - safety scale: 0.9						    - safety scale: 0.9

     FLOW   t=  0.0000:    6.045352321554e+04    8.3610520410 |	     FLOW   t=  0.0000:    6.045352321554e+04    8.3807340866
     FLOW   t=  0.1000:    2.240951045804e+04    5.8074230709 |	     FLOW   t=  0.1000:    2.240951045804e+04    5.8800826658
     FLOW   t=  0.2000:    9.301649676976e+03    3.6198040811 |	     FLOW   t=  0.2000:    9.301649676976e+03    3.7691403564
     FLOW   t=  0.3000:    4.685618161693e+03    2.3886191584 |	     FLOW   t=  0.3000:    4.685618161693e+03    2.5921602145
     FLOW   t=  0.4000:    2.811535288892e+03    1.7118129634 |	     FLOW   t=  0.4000:    2.811535288892e+03    1.9496911521
     FLOW   t=  0.5000:    1.924613685513e+03    1.3151596276 |	     FLOW   t=  0.5000:    1.924613685513e+03    1.5749985392
     FLOW   t=  0.6000:    1.444978736189e+03    1.0649340525 |	     FLOW   t=  0.6000:    1.444978736189e+03    1.3393545892
     FLOW   t=  0.7000:    1.156552124174e+03    8.9662220256 |	     FLOW   t=  0.7000:    1.156552124174e+03    1.1811114858
     FLOW   t=  0.8000:    9.682350899019e+02    7.7729136056 |	     FLOW   t=  0.8000:    9.682350899019e+02    1.0689857916
     FLOW   t=  0.9000:    8.372170943376e+02    6.8900235097 |	     FLOW   t=  0.9000:    8.372170943376e+02    9.8602053967
     FLOW   t=  1.0000:    7.414584807706e+02    6.2138481493 |	     FLOW   t=  1.0000:    7.414584807706e+02    9.2245194260
 # Plaquette: 1.770069921558924				      |	 # Plaquette: 1.6244204140351874

As we can see, all packages have the same versions (even the CUDA artifacts). The only difference seems to be that interactive partition has Intel CPU (skylake), versus the production partition, that has EPYC. This translates in a different version of the LLVM toolchain.

Summarizing:

  • Julia v1.5.4 with CUDAv2: Results correct
  • Julia v1.6,1.7 with CUDAv3: Wrong results with LLVM: libLLVM-12.0.1 (ORCJIT, znver2), correct results with libLLVM-12.0.1 (ORCJIT, skylake-avx512)

Just in case, the correct/incorrect is judged by comparing the results of the computation (up to machine precision) with a reference C implementation. Differences are large, and can be seen in the output, in the lines:

# Plaquette: 1.770069921558924				      |	 # Plaquette: 1.6244204140351874

I am pretty convinced that this is a bug… Does this helps in anyway in narrowing down the bug?

Not really. You’ll have to try and isolate the actual computations that go wrong.

Just to say that with the last version (CUDA 3.10), this problem has disappeared.

Many thanks!

A.