CUDA on Julia 1.7 giving wrong results

Hi all,

I have a fairly complex code that has worked well with julia 1.5 - 1.6. I have tried to run this with te latest julia 1.7, and the code runs, but produces wrong results. This is the list of packages used, in case anyone sees something wrong:

     Project LatticeGPU v0.1.0
      Status `~/code/latticegpu.jl/Manifest.toml`
  [a4c015fc] ANSIColoredPrinters v0.0.1
  [621f4979] AbstractFFTs v1.0.1
  [79e6a3ab] Adapt v3.3.1
  [375f315e] BDIO v0.1.0 ``
  [ab4f0b2a] BFloat16s v0.1.0
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v3.3.3
  [d360d2e6] ChainRulesCore v0.10.11
  [34da2185] Compat v3.31.0
  [864edb3b] DataStructures v0.18.9
  [ffbed154] DocStringExtensions v0.8.5
  [e30172f5] Documenter v0.27.10
  [e2ba6199] ExprTools v0.1.6
  [0c68f7d7] GPUArrays v7.0.1
  [61eb1bfa] GPUCompiler v0.12.5
  [b5f81e59] IOCapture v0.2.2
  [692b3bcd] JLLWrappers v1.3.0
  [682c06a0] JSON v0.21.2
  [929cbde3] LLVM v4.1.0
  [2ab3a3ac] LogExpFunctions v0.2.5
  [49dea1ee] Nettle v0.5.1
  [bac558e1] OrderedCollections v1.4.1
  [69de0a69] Parsers v2.1.3
  [21216c6a] Preferences v1.2.2
  [74087812] Random123 v1.4.2
  [e6cf234a] RandomNumbers v1.4.0
  [189a3867] Reexport v1.1.0
  [ae029012] Requires v1.1.3
  [276daf66] SpecialFunctions v1.5.1
  [a759f4b9] TimerOutputs v0.5.13
  [dad2f222] LLVMExtra_jll v0.0.6+0
  [4c82536e] Nettle_jll v3.7.2+0
  [efe28fd5] OpenSpecFun_jll v0.5.5+0
  [0dad84c5] ArgTools
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8bb1440f] DelimitedFiles
  [8ba89e20] Distributed
  [f43a241f] Downloads
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [fa267f1f] TOML
  [a4e569a6] Tar
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll
  [781609d7] GMP_jll
  [deac9b47] LibCURL_jll
  [29816b5a] LibSSH2_jll
  [c8ffd9c3] MbedTLS_jll
  [14a3606d] MozillaCACerts_jll
  [83775a58] Zlib_jll
  [8e850ede] nghttp2_jll
  [3f19e933] p7zip_jll

I would appreciate on any help on how to debug such a problem or if someone has observed anything similar.


Per Please read: make it easier to help you, there’s basically nothing we can do without a (relatively) MWE.

I would love to have a MWE. Unfortunately I only have thousands of lines of code that were producing correct results (checked with machine precision against a C code) with julia 1.5-1.6. When using Julia 1.7 the same code/envirment produces completely wrong results without any warnning/error.

I understand that with this information is impossible to solve the issue with the code. The hope was that julia 1.7 is actually incompatible with some CUDA versions or that someone else has observed something similar. If this is not the case, I would really need to bisect the code to pin down the line of code that produces the wrong results in julia 1.7.

Many thanks

I haven’t seen anything here or on the CUDA.jl issues, but perhaps with time someone will. One thing you could check is whether any artifact versions have changed. If an error is in a particular (non-Julia) CUDA library, rolling those back to the versions you had with 1.6 might help isolate things. Alternatively, you could try creating another env with the latest version of all the CUDA deps (v3.3 is probably out of date now) and see if that works.

Many thanks,

I use exactly the same versions of the all packages with all julia versions (via the package manager and instantiate). The machine is the same. The only difference is if I use julia 1.7.1 or julia 1.6.5 (both downloaded from the julialang webpage). This should also guarantee that the same artifacts versions are used, right?

I am stuck in CUDA v3.3 because of a breaking change in later versions (see Problem with CUDAv3 - #9 by shiroghost). In any case, it is worrying that exactly the same packages give different results in different julia versions (without random numbers involved).

I would probably have to chase this down the rabbit hole…

Maybe, but since CUDA.jl will lazily download some based on your local detected CUDA toolchain it wouldn’t hurt to check. If the versions listed in CUDA.versioninfo() is the same across 1.6 and 1.7, I’d assume nothing has changed.

I don’t understand the issue well enough to help, but given that new Julia versions often bring new LLVM versions as well, it could well be a suspect. Have you tried following Tim’s recommendations in the latest comment there?