Hi Again,
I have been producing results with the old CUDA version until now (was in a rush, and nothing seemed to help). I have decided to try again, and have found something interesting.
In our local cluster there are two partitions: one for production runs (V100,A100), and one small interactive partition for tests. I have found that with the last julia version (1.7.2 and CUDAv3.3.3), the code runs fine (without bugs) in the interactive partition. On the other hand, in the production partition, the code runs but produces wrong results.
Here is an sdiff of both runs:
Julia Version 1.7.2 Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC) Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info: Platform Info:
OS: Linux (x86_64-pc-linux-gnu) OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz | CPU: AMD EPYC 7532 32-Core Processor
WORD_SIZE: 64 WORD_SIZE: 64
LIBM: libopenlibm LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512) | LLVM: libLLVM-12.0.1 (ORCJIT, znver2)
Environment: Environment:
JULIA_GPG = 3673DF529D9049477F76B37566E3C7DC03D6E495 JULIA_GPG = 3673DF529D9049477F76B37566E3C7DC03D6E495
JULIA_PATH = /usr/local/julia JULIA_PATH = /usr/local/julia
JULIA_DEPOT_PATH = /lustre/alberto/julia_packages JULIA_DEPOT_PATH = /lustre/alberto/julia_packages
JULIA_VERSION = 1.7.2 JULIA_VERSION = 1.7.2
Status `/lustre/alberto/code/test_latticegpu.jl/Project Status `/lustre/alberto/code/test_latticegpu.jl/Project
[5e92007d] ADerrors v0.1.0 `https://gitlab.ift.uam-csic.es/ [5e92007d] ADerrors v0.1.0 `https://gitlab.ift.uam-csic.es/
[c7e460c6] ArgParse v1.1.4 [c7e460c6] ArgParse v1.1.4
[375f315e] BDIO v0.1.0 `https://gitlab.ift.uam-csic.es/albe [375f315e] BDIO v0.1.0 `https://gitlab.ift.uam-csic.es/albe
[052768ef] CUDA v3.3.3 [052768ef] CUDA v3.3.3
[944b1d66] CodecZlib v0.7.0 [944b1d66] CodecZlib v0.7.0
[958c3683] LatticeGPU v0.1.0 `https://igit.ific.uv.es/alram [958c3683] LatticeGPU v0.1.0 `https://igit.ific.uv.es/alram
[91a5bcdd] Plots v1.23.5 [91a5bcdd] Plots v1.23.5
[a759f4b9] TimerOutputs v0.5.13 [a759f4b9] TimerOutputs v0.5.13
[3bb67fe8] TranscodingStreams v0.9.6 [3bb67fe8] TranscodingStreams v0.9.6
[b77e0a4c] InteractiveUtils [b77e0a4c] InteractiveUtils
[44cfe95a] Pkg [44cfe95a] Pkg
[de0858da] Printf [de0858da] Printf
[9a3f8284] Random [9a3f8284] Random
[fa267f1f] TOML [fa267f1f] TOML
CUDA toolkit 11.3.1, artifact installation CUDA toolkit 11.3.1, artifact installation
CUDA driver 11.5.0 CUDA driver 11.5.0
NVIDIA driver 495.29.5 NVIDIA driver 495.29.5
Libraries: Libraries:
- CUBLAS: 11.5.1 - CUBLAS: 11.5.1
- CURAND: 10.2.4 - CURAND: 10.2.4
- CUFFT: 10.4.2 - CUFFT: 10.4.2
- CUSOLVER: 11.1.2 - CUSOLVER: 11.1.2
- CUSPARSE: 11.6.0 - CUSPARSE: 11.6.0
- CUPTI: 14.0.0 - CUPTI: 14.0.0
- NVML: 11.0.0+495.29.5 - NVML: 11.0.0+495.29.5
- CUDNN: 8.20.0 (for CUDA 11.3.0) - CUDNN: 8.20.0 (for CUDA 11.3.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0) - CUTENSOR: 1.3.0 (for CUDA 11.2.0)
Toolchain: Toolchain:
- Julia: 1.7.2 - Julia: 1.7.2
- LLVM: 12.0.1 - LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6. - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_5 - Device capability support: sm_35, sm_37, sm_50, sm_52, sm_5
1 device: 1 device:
0: Tesla P100-PCIE-12GB (sm_60, 11.910 GiB / 11.912 GiB ava | 0: NVIDIA A100-PCIE-40GB (sm_80, 39.583 GiB / 39.586 GiB av
## Analizing configuration: /lustre/alberto/code/test_lattic ## Analizing configuration: /lustre/alberto/code/test_lattic
Lattice dimensions: 4 Lattice dimensions: 4
Lattice size: 8 x 8 x 8 x 8 Lattice size: 8 x 8 x 8 x 8
Time boundary conditions: PERIODIC Time boundary conditions: PERIODIC
Thread block size: 4 x 4 x 4 x 4 [256] (Number of Thread block size: 4 x 4 x 4 x 4 [256] (Number of
Twist tensor: (0, 0, 0, 0, 0, 0) Twist tensor: (0, 0, 0, 0, 0, 0)
# [import_cern64] Read from conf file: Int32[8, 8, 8, 8] (pla # [import_cern64] Read from conf file: Int32[8, 8, 8, 8] (pla
Group: SU3{Float64} Group: SU3{Float64}
- beta: 6.7 - beta: 6.7
- c0: 1.0 - c0: 1.0
- cG: (0.0, 0.0) - cG: (0.0, 0.0)
## ##
# Plaquette: 1.770069921558924 | # Plaquette: 1.6244204140351874
## ##
WILSON flow integrator WILSON flow integrator
* Two stage scheme. Coefficients: * Two stage scheme. Coefficients:
stg 1: -0.4722222222222222 0.8888888888888888 stg 1: -0.4722222222222222 0.8888888888888888
stg 2: -1.0 0.75 stg 2: -1.0 0.75
* Fixed step size parameters: eps = 0.01 * Fixed step size parameters: eps = 0.01
* Adaptive step size parameters: tol = 1.0e-7 * Adaptive step size parameters: tol = 1.0e-7
- max eps: 0.1 - max eps: 0.1
- initial eps: 0.005 - initial eps: 0.005
- safety scale: 0.9 - safety scale: 0.9
FLOW t= 0.0000: 6.045352321554e+04 8.3610520410 | FLOW t= 0.0000: 6.045352321554e+04 8.3807340866
FLOW t= 0.1000: 2.672729605197e+04 6.2484621765 | FLOW t= 0.1000: 2.672729605197e+04 6.3127741601
FLOW t= 0.2000: 1.221783495047e+04 4.1479042540 | FLOW t= 0.2000: 1.221783495047e+04 4.2806249807
FLOW t= 0.3000: 6.293795491842e+03 2.7906809002 | FLOW t= 0.3000: 6.293795491842e+03 2.9780991633
FLOW t= 0.4000: 3.696042514745e+03 1.9883567766 | FLOW t= 0.4000: 3.696042514745e+03 2.2132290274
FLOW t= 0.5000: 2.435031662979e+03 1.5047723103 | FLOW t= 0.5000: 2.435031662979e+03 1.7546378274
FLOW t= 0.6000: 1.757791223073e+03 1.1983335066 | FLOW t= 0.6000: 1.757791223073e+03 1.4651451702
FLOW t= 0.7000: 1.359990889456e+03 9.9351194056 | FLOW t= 0.7000: 1.359990889456e+03 1.2721325098
FLOW t= 0.8000: 1.107906613390e+03 8.4986022984 | FLOW t= 0.8000: 1.107906613390e+03 1.1369505321
FLOW t= 0.9000: 9.377549064903e+02 7.4486486487 | FLOW t= 0.9000: 9.377549064903e+02 1.0382003125
FLOW t= 1.0000: 8.167956501505e+02 6.6541140425 | FLOW t= 1.0000: 8.167956501505e+02 9.6347265001
ZEUTHEN flow integrator ZEUTHEN flow integrator
* Two stage scheme. Coefficients: * Two stage scheme. Coefficients:
stg 1: -0.4722222222222222 0.8888888888888888 stg 1: -0.4722222222222222 0.8888888888888888
stg 2: -1.0 0.75 stg 2: -1.0 0.75
* Fixed step size parameters: eps = 0.01 * Fixed step size parameters: eps = 0.01
* Adaptive step size parameters: tol = 1.0e-7 * Adaptive step size parameters: tol = 1.0e-7
- max eps: 0.1 - max eps: 0.1
- initial eps: 0.005 - initial eps: 0.005
- safety scale: 0.9 - safety scale: 0.9
FLOW t= 0.0000: 6.045352321554e+04 8.3610520410 | FLOW t= 0.0000: 6.045352321554e+04 8.3807340866
FLOW t= 0.1000: 2.240951045804e+04 5.8074230709 | FLOW t= 0.1000: 2.240951045804e+04 5.8800826658
FLOW t= 0.2000: 9.301649676976e+03 3.6198040811 | FLOW t= 0.2000: 9.301649676976e+03 3.7691403564
FLOW t= 0.3000: 4.685618161693e+03 2.3886191584 | FLOW t= 0.3000: 4.685618161693e+03 2.5921602145
FLOW t= 0.4000: 2.811535288892e+03 1.7118129634 | FLOW t= 0.4000: 2.811535288892e+03 1.9496911521
FLOW t= 0.5000: 1.924613685513e+03 1.3151596276 | FLOW t= 0.5000: 1.924613685513e+03 1.5749985392
FLOW t= 0.6000: 1.444978736189e+03 1.0649340525 | FLOW t= 0.6000: 1.444978736189e+03 1.3393545892
FLOW t= 0.7000: 1.156552124174e+03 8.9662220256 | FLOW t= 0.7000: 1.156552124174e+03 1.1811114858
FLOW t= 0.8000: 9.682350899019e+02 7.7729136056 | FLOW t= 0.8000: 9.682350899019e+02 1.0689857916
FLOW t= 0.9000: 8.372170943376e+02 6.8900235097 | FLOW t= 0.9000: 8.372170943376e+02 9.8602053967
FLOW t= 1.0000: 7.414584807706e+02 6.2138481493 | FLOW t= 1.0000: 7.414584807706e+02 9.2245194260
# Plaquette: 1.770069921558924 | # Plaquette: 1.6244204140351874
As we can see, all packages have the same versions (even the CUDA artifacts). The only difference seems to be that interactive partition has Intel CPU (skylake), versus the production partition, that has EPYC. This translates in a different version of the LLVM toolchain.
Summarizing:
- Julia v1.5.4 with CUDAv2: Results correct
- Julia v1.6,1.7 with CUDAv3: Wrong results with LLVM: libLLVM-12.0.1 (ORCJIT, znver2), correct results with libLLVM-12.0.1 (ORCJIT, skylake-avx512)
Just in case, the correct/incorrect is judged by comparing the results of the computation (up to machine precision) with a reference C implementation. Differences are large, and can be seen in the output, in the lines:
# Plaquette: 1.770069921558924 | # Plaquette: 1.6244204140351874
I am pretty convinced that this is a bug… Does this helps in anyway in narrowing down the bug?