CUDA.jl tests on windows

Trying to migrate to CUDA.jl. However can’t make tests passing (both with artifact download and local). Not 100% sure if this is a bug in CUDA.jl or I missed something. Any idea if tests meant to pass on Windows (I understand the main dev env is linux for this lib)?

I created just an empty environment with only CUDA and Distributed packages

test with artifacts

julia> Pkg.test("CUDA")
    Testing CUDA
Status `C:\Users\andre\AppData\Local\Temp\jl_OtYeSM\Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.18
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.1
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.16
  [476501e8] SLEEFPirates v0.5.4
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.24
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [6462fe0b] Sockets 
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
┌ Warning: NNPACK not available for your platform: Windows(x86_64-w64-mingw32-libgfortran5-cxx11)
│         You will be able to use only the default Julia NNlib backend
└ @ NNlib C:\Users\andre\.julia\packages\NNlib\sSn9M\src\NNlib.jl:14
┌ Debug: Initializing CUDA driver
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:91
┌ Debug: Trying to use artifacts...
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\h38pe\deps\bindeps.jl:121
┌ Debug: Selecting artifacts based on driver version 11.0.0
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\h38pe\deps\bindeps.jl:130
ERROR: LoadError: AssertionError: isfile(__libcupti[])
Stacktrace:
 [1] use_artifact_cuda() at C:\Users\andre\.julia\packages\CUDA\h38pe\deps\bindeps.jl:157
 [2] __init_dependencies__() at C:\Users\andre\.julia\packages\CUDA\h38pe\deps\bindeps.jl:314
 [3] __runtime_init__() at C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:112
 [4] (::CUDA.var"#609#610"{Bool})() at C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:32
 [5] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at .\lock.jl:161
 [6] _functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:26
 [7] functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:19
 [8] macro expansion at C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:50 [inlined]
 [9] toolkit_release() at C:\Users\andre\.julia\packages\CUDA\h38pe\deps\bindeps.jl:32
 [10] top-level scope at C:\Users\andre\.julia\packages\CUDA\h38pe\test\runtests.jl:112
 [11] include(::String) at .\client.jl:439
 [12] top-level scope at none:6
in expression starting at C:\Users\andre\.julia\packages\CUDA\h38pe\test\runtests.jl:112
ERROR: Package CUDA errored during testing
Stacktrace:
 [1] pkgerror(::String, ::Vararg{String,N} where N) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Types.jl:53   
 [2] test(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Operations.jl:1510
 [3] test(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}; coverage::Bool, test_fn::Nothing, julia_args::Cmd, test_args::Cmd, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:316
 [4] test(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:303
 [5] #test#68 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:297 [inlined]
 [6] test at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:297 [inlined]
 [7] #test#67 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:296 [inlined]
 [8] test at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:296 [inlined]
 [9] test(::String; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:295
 [10] test(::String) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:295
 [11] top-level scope at none:0

aj

ok. Maybe there is something going on with artifact download. So I try using the ones coming with local CUDA installation from Nvidia. here is that I am getting

Status `C:\Users\andre\AppData\Local\Temp\jl_zbZb7L\Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2     
  [b552c78f] DiffRules v1.0.1       
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.18
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.1
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.16
  [476501e8] SLEEFPirates v0.5.4
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.24
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown 
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
┌ Info: System information:
│ CUDA toolkit 11.0.194, local installation
│ CUDA driver 11.0.0
│ NVIDIA driver 451.67.0
│ 
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: missing
│ - NVML: 11.0.0+451.67
│ - CUDNN: missing
│ - CUTENSOR: missing
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ Environment:
│ - JULIA_CUDA_VERBOSE: true
│ - JULIA_CUDA_USE_BINARYBUILDER: false
│ 
│ 2 device(s):
│ - GeForce RTX 2070 (sm_75, 7.843 GiB / 8.000 GiB available)
└ - GeForce RTX 2080 Ti (sm_75, 10.790 GiB / 11.000 GiB available)
[ Info: Testing using device 2 (GeForce RTX 2080 Ti, UUID 4d228dab-1287-5f37-f4db-e53f82e2bc60)
[ Info: Skipping the following tests: cudnn, cutensor, device\wmma
                                      |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                         (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                    (2) |         failed at 2020-07-20T15:51:54.047
apiutils                          (3) |         failed at 2020-07-20T15:51:54.253
codegen                           (6) |         failed at 2020-07-20T15:52:10.367
curand                            (9) |         failed at 2020-07-20T15:52:14.689
iterator                         (15) |         failed at 2020-07-20T15:52:22.522
broadcast                         (5) |         failed at 2020-07-20T15:52:30.061

and enabling debug lines I get much larger printout than discource allows. so here: https://pastebin.com/embed_js/gdFEmLXZ

Should be fixed on CUDA master, would be useful if you could verify that (also requires GPUArrays#master).

with local installation I now get

┌ Error: Exception while generating log record in module CUDA at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\discovery.jl:96
│   exception =
│    MethodError: no method matching iterate(::typeof(names))
│    Closest candidates are:
│      iterate(::Core.SimpleVector) at essentials.jl:603
│      iterate(::Core.SimpleVector, ::Any) at essentials.jl:603
│      iterate(::ExponentialBackOff) at error.jl:253
│      ...
│    Stacktrace:
│     [1] join(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Function, ::String) at .\strings\io.jl:294
│     [2] sprint(::Function, ::Function, ::Vararg{Any,N} where N; context::Nothing, sizehint::Int64) at .\strings\io.jl:105
│     [3] sprint(::Function, ::Function, ::Vararg{Any,N} where N) at .\strings\io.jl:101
│     [4] join(::Function, ::String) at .\strings\io.jl:301
│     [5] macro expansion at .\logging.jl:322 [inlined]
│     [6] find_binary(::String; locations::Array{String,1}) at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\discovery.jl:96
│     [7] find_cuda_binary at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\discovery.jl:224 [inlined]
│     [8] use_local_cuda() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:189
│     [9] __init_dependencies__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:313
│     [10] __runtime_init__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:112
│     [11] (::CUDA.var"#609#610"{Bool})() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:32
│     [12] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at .\lock.jl:161
│     [13] _functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:26
│     [14] functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:19
│     [15] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:50 [inlined]
│     [16] toolkit_release() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:32
│     [17] top-level scope at C:\Users\andre\.julia\packages\CUDA\B3eVM\test\runtests.jl:112
│     [18] include(::String) at .\client.jl:439
│     [19] top-level scope at none:6
│     [20] eval(::Module, ::Any) at .\boot.jl:331
│     [21] exec_options(::Base.JLOptions) at .\client.jl:264
│     [22] _start() at .\client.jl:484
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\discovery.jl:96

Double checking that CUDA and GPUArrays are #master

  [621f4979] AbstractFFTs v0.5.0   
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0 #master (https://github.com/JuliaGPU/CUDA.jl)
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1 #master (https://github.com/JuliaGPU/GPUArrays.jl)
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.19
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.1
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.17
  [476501e8] SLEEFPirates v0.5.5
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.25

with artifacts I get first this

┌ Debug: Selecting artifacts based on driver version 11.0.0
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:138
┌ Debug: CUDA toolkit identified as 11.0.194
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\discovery.jl:322
┌ Debug: Using CUDA 11.0.194 from an artifact at C:\Users\andre\.julia\artifacts\bbabcb11fa2760ee54b821c074b8b77699c260b3
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:177
┌ Debug: Could not use CUDNN from artifacts
│   exception =
│    KeyError: key v"11.0.0" not found
│    Stacktrace:
│     [1] getindex at .\dict.jl:477 [inlined]
│     [2] use_artifact_cudnn(::VersionNumber) at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:248
│     [3] use_artifact_cuda() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:179
│     [4] __init_dependencies__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:309
│     [5] __runtime_init__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:112
│     [6] (::CUDA.var"#609#610"{Bool})() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:32
│     [7] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at .\lock.jl:161
│     [8] _functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:26
│     [9] functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:19
│     [10] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:50 [inlined]
│     [11] toolkit_release() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:32
│     [12] top-level scope at C:\Users\andre\.julia\packages\CUDA\B3eVM\test\runtests.jl:112
│     [13] include(::String) at .\client.jl:439
│     [14] top-level scope at none:6
│     [15] eval(::Module, ::Any) at .\boot.jl:331
│     [16] exec_options(::Base.JLOptions) at .\client.jl:264
│     [17] _start() at .\client.jl:484
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:250
┌ Debug: Could not use CUTENSOR from artifacts
│   exception =
│    Cannot locate artifact 'CUTENSOR_CUDA110' in 'C:\Users\andre\.julia\packages\CUDA\B3eVM\Artifacts.toml'
│    Stacktrace:
│     [1] error(::String) at .\error.jl:33
│     [2] do_artifact_str(::String, ::Dict{String,Any}, ::String, ::Module) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Artifacts.jl:1019
│     [3] #invokelatest#1 at .\essentials.jl:712 [inlined]
│     [4] invokelatest at .\essentials.jl:711 [inlined]
│     [5] macro expansion at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Artifacts.jl:1058 [inlined]
│     [6] (::CUDA.var"#672#675")() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:272
│     [7] use_artifact_cutensor(::VersionNumber) at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:279
│     [8] use_artifact_cuda() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:180
│     [9] __init_dependencies__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:309
│     [10] __runtime_init__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:112
│     [11] (::CUDA.var"#609#610"{Bool})() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:32
│     [12] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at .\lock.jl:161
│     [13] _functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:26
│     [14] functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:19
│     [15] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:50 [inlined]
│     [16] toolkit_release() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:32
│     [17] top-level scope at C:\Users\andre\.julia\packages\CUDA\B3eVM\test\runtests.jl:112
│     [18] include(::String) at .\client.jl:439
│     [19] top-level scope at none:6
│     [20] eval(::Module, ::Any) at .\boot.jl:331
│     [21] exec_options(::Base.JLOptions) at .\client.jl:264
│     [22] _start() at .\client.jl:484
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:281
┌ Debug: Toolchain with LLVM 8.0.1, CUDA driver 11.0.0 and toolkit 11.0.194 supports devices 3.5, 3.7, 5.0, 5.2, 5.3, 6.0, 6.1, 6.2, 7.0, 7.2 and 7.5; PTX 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1 and 6.3
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\compatibility.jl:239
┌ Info: System information:
│ CUDA toolkit 11.0.194, artifact installation
│ CUDA driver 11.0.0
│ NVIDIA driver 451.67.0
│ 
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: 13.0.0
│ - NVML: 11.0.0+451.67
│ - CUDNN: missing
│ - CUTENSOR: missing
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ Environment:
│ - JULIA_CUDA_VERBOSE: true
│ 
│ 2 device(s):
│ - GeForce RTX 2070 (sm_75, 7.843 GiB / 8.000 GiB available)
└ - GeForce RTX 2080 Ti (sm_75, 10.790 GiB / 11.000 GiB available)
[ Info: Testing using device 2 (GeForce RTX 2080 Ti, UUID 4d228dab-1287-5f37-f4db-e53f82e2bc60)
[ Info: Skipping the following tests: cudnn, cutensor, device\wmma

and then more

┌ Debug: Could not use CUDNN from artifacts
│   exception =
│    KeyError: key v"11.0.0" not found
│    Stacktrace:
│     [1] getindex at .\dict.jl:477 [inlined]
│     [2] use_artifact_cudnn(::VersionNumber) at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:248
│     [3] use_artifact_cuda() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:179
│     [4] __init_dependencies__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:309
│     [5] __runtime_init__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:112
│     [6] (::CUDA.var"#609#610"{Bool})() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:32
│     [7] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at .\lock.jl:161
│     [8] _functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:26
│     [9] functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:19
│     [10] libcuda at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:50 [inlined]
│     [11] (::CUDA.var"#714#cache_fptr!#47")() at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\utils\call.jl:31
│     [12] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\utils\call.jl:39 [inlined]
│     [13] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\cudadrv\libcuda.jl:149 [inlined]
│     [14] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\cudadrv\error.jl:108 [inlined]
│     [15] cuCtxGetCurrent(::Base.RefValue{Ptr{Nothing}}) at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\utils\call.jl:93
│     [16] CuCurrentContext at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\cudadrv\context.jl:81 [inlined]
│     [17] initialize_thread(::Int64) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\state.jl:51
│     [18] prepare_cuda_call at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\state.jl:34 [inlined]
│     [19] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\pool.jl:432 [inlined]
│     [20] top-level scope at C:\Users\andre\.julia\packages\CUDA\B3eVM\test\setup.jl:46
│     [21] eval at .\boot.jl:331 [inlined]
│     [22] runtests(::Function, ::String, ::Bool, ::Nothing) at C:\Users\andre\.julia\packages\CUDA\B3eVM\test\setup.jl:56
│     [23] (::Distributed.var"#104#106"{Distributed.CallMsg{:call_fetch}})() at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:294
│     [24] run_work_thunk(::Distributed.var"#104#106"{Distributed.CallMsg{:call_fetch}}, ::Bool) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:79
│     [25] macro expansion at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:294 [inlined]
│     [26] (::Distributed.var"#103#105"{Distributed.CallMsg{:call_fetch},Distributed.MsgHeader,Sockets.TCPSocket})() at .\task.jl:358
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:250

Ah yes, a typo in the debug printing. Fixed!

This is expected, and only a debug message. There wasn’t a CUDNN for CUDA 11 last I checked (only a release candidate).

now local installation fails with no obvious (to me) reason

downloaded artifacts

There are no errors in those logs, and your toolchain is correctly discovered. Try calling CUDA.versioninfo(). At the end of the logs, there’s test failures, but without a stack trace so its unclear what the cause is (did you press CTRL-C?)

Sorry about that. Yes, I pressed Ctrl+C after first test fail messages. so now I run it fully.

I get a lot of those

Status `C:\Users\andre\AppData\Local\Temp\jl_7lNIMd\Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0 #master (https://github.com/JuliaGPU/CUDA.jl)
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1 #master (https://github.com/JuliaGPU/GPUArrays.jl)
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.19
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.1
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.17
  [476501e8] SLEEFPirates v0.5.5
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.25
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test 
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
┌ Info: System information:
│ CUDA toolkit 11.0.194, artifact installation
│ CUDA driver 11.0.0
│ NVIDIA driver 451.67.0
│ 
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: 13.0.0
│ - NVML: 11.0.0+451.67
│ - CUDNN: missing
│ - CUTENSOR: missing
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ Environment:
│ - JULIA_CUDA_VERBOSE: true
│ 
│ 2 device(s):
│ - GeForce RTX 2070 (sm_75, 7.843 GiB / 8.000 GiB available)
└ - GeForce RTX 2080 Ti (sm_75, 10.790 GiB / 11.000 GiB available)
[ Info: Testing using device 2 (GeForce RTX 2080 Ti, UUID 4d228dab-1287-5f37-f4db-e53f82e2bc60)
[ Info: Skipping the following tests: cudnn, cutensor, device\wmma
                                      |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                         (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |  
initialization                    (2) |         failed at 2020-07-24T13:27:34.893
apiutils                          (3) |         failed at 2020-07-24T13:27:35.226
codegen                           (6) |         failed at 2020-07-24T13:27:53.291
curand                            (9) |         failed at 2020-07-24T13:27:57.349
iterator                         (15) |         failed at 2020-07-24T13:28:04.485
cufft                             (8) |         failed at 2020-07-24T13:28:13.621
broadcast                         (5) |         failed at 2020-07-24T13:28:14.394
memory                           (16) |         failed at 2020-07-24T13:28:20.196
nvml                             (18) |         failed at 2020-07-24T13:28:30.279
cusparse                         (11) |         failed at 2020-07-24T13:28:31.791
nnlib                            (17) |         failed at 2020-07-24T13:28:37.649
nvtx                             (19) |         failed at 2020-07-24T13:28:39.256
pointer                          (20) |         failed at 2020-07-24T13:28:40.567
array                             (4) |         failed at 2020-07-24T13:28:47.409
cublas                            (7) |         failed at 2020-07-24T13:28:50.406
forwarddiff                      (14) |         failed at 2020-07-24T13:28:57.716
cusolver                         (10) |         failed at 2020-07-24T13:28:59.912
utils                            (24) |         failed at 2020-07-24T13:29:05.855
cudadrv\context                  (25) |         failed at 2020-07-24T13:29:06.031
cudadrv\devices                  (26) |         failed at 2020-07-24T13:29:06.953
threading                        (23) |         failed at 2020-07-24T13:29:14.078
cudadrv\errors                   (27) |         failed at 2020-07-24T13:29:14.339
statistics                       (21) |         failed at 2020-07-24T13:29:15.874
cudadrv\events                   (28) |         failed at 2020-07-24T13:29:16.084
cudadrv\execution                (29) |         failed at 2020-07-24T13:29:26.568
cudadrv\memory                   (30) |         failed at 2020-07-24T13:29:27.926
cudadrv\occupancy                (32) |         failed at 2020-07-24T13:29:32.272
cudadrv\module                   (31) |         failed at 2020-07-24T13:29:32.563
cudadrv\profile                  (33) |         failed at 2020-07-24T13:29:33.742
texture                          (22) |         failed at 2020-07-24T13:29:35.591
cudadrv\stream                   (34) |         failed at 2020-07-24T13:29:39.586
cudadrv\version                  (35) |         failed at 2020-07-24T13:29:40.345
device\array                     (37) |         failed at 2020-07-24T13:29:58.933
examples                         (12) |         failed at 2020-07-24T13:29:59.271
gpuarrays/input output           (42) |         failed at 2020-07-24T13:30:01.632
device\pointer                   (39) |         failed at 2020-07-24T13:30:11.444
cusolver\cusparse                (36) |         failed at 2020-07-24T13:30:13.575
gpuarrays/math                   (41) |         failed at 2020-07-24T13:30:14.664
execution                        (13) |         failed at 2020-07-24T13:30:17.007
gpuarrays/interface              (44) |         failed at 2020-07-24T13:30:19.197
gpuarrays/value constructors     (43) |         failed at 2020-07-24T13:30:23.082
gpuarrays/indexing               (40) |         failed at 2020-07-24T13:30:30.843
gpuarrays/conversions            (48) |         failed at 2020-07-24T13:30:33.976
gpuarrays/iterator constructors  (45) |         failed at 2020-07-24T13:30:38.571
gpuarrays/constructors           (50) |         failed at 2020-07-24T13:30:41.724
gpuarrays/uniformscaling         (46) |         failed at 2020-07-24T13:30:45.063
gpuarrays/fft                    (49) |         failed at 2020-07-24T13:30:45.665
gpuarrays/base                   (52) |         failed at 2020-07-24T13:31:12.59
gpuarrays/random                 (51) |         failed at 2020-07-24T13:31:14.822
device\intrinsics                (38) |         failed at 2020-07-24T13:31:16.072
gpuarrays/linear algebra         (47) |         failed at 2020-07-24T13:31:33.447
gpuarrays/broadcasting           (54) |         failed at 2020-07-24T13:31:57.442
gpuarrays/mapreduce essentials   (53) |         failed at 2020-07-24T13:32:22.903
gpuarrays/mapreduce (old tests)  (56) |         failed at 2020-07-24T13:32:40.311
gpuarrays/mapreduce derivatives  (55) |         failed at 2020-07-24T13:33:17.178
initialization: Error During Test at none:1
  Test threw exception
  Expression: initialization
  On worker 2:
  InexactError: check_top_bit(Int64, 18446744073709551615)
  throw_inexacterror at .\boot.jl:557
  check_top_bit at .\boot.jl:571 [inlined]
  toInt64 at .\boot.jl:632 [inlined]
  Int64 at .\boot.jl:707 [inlined]
  #448 at .\none:0 [inlined]
  iterate at .\generator.jl:47 [inlined]
  _all at .\reduce.jl:819 [inlined]
  all at .\reduce.jl:815 [inlined]
  Dict at .\dict.jl:130
  compute_processes at C:\Users\andre\.julia\packages\CUDA\Wfh4d\lib\nvml\device.jl:123 [inlined]
  runtests at C:\Users\andre\.julia\packages\CUDA\Wfh4d\test\setup.jl:64
  #104 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:294
  run_work_thunk at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:79       
  macro expansion at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:294 [inlined]
  #103 at .\task.jl:358

also this type

Worker 6 failed running test codegen:
Some tests did not pass: 17 passed, 1 failed, 0 errored, 0 broken.
codegen: Test Failed at C:\Users\andre\.julia\packages\CUDA\Wfh4d\test\codegen.jl:78
  Expression: !(occursin("gpu_report_exception", ir))
   Evaluated: !(occursin("gpu_report_exception", "\n;  @ C:\\Users\\andre\\.julia\\packages\\CUDA\\Wfh4d\\test\\codegen.jl:76 within `bar'\n; Function Attrs: uwtable\ndefine i64 @julia_bar_19729(i64) local_unnamed_addr #0 {\ntop:\n; ┌ @ operators.jl:294 within `>'\n; │┌ @ int.jl:49 within `<'\n    %1 = icmp sgt i64 %0, 0\n; └└\n; ┌ @ C:\\Users\\andre\\.julia\\packages\\CUDA\\Wfh4d\\src\\device\\llvm.jl:5 within `assume'\n   %2 = call fastcc %jl_value_t addrspace(10)* @gpu_gc_pool_alloc(i64 16)\n   %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*\n   %4 = bitcast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(10)* addrspace(10)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 381362496 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %4, align 8\n   %5 = bitcast %jl_value_t addrspace(11)* %3 to i8 addrspace(11)*\n   %6 = getelementptr inbounds i8, i8 addrspace(11)* %5, i64 8\n   %7 = bitcast i8 addrspace(11)* %6 to %jl_value_t addrspace(10)* addrspace(11)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1031050640 to %jl_value_t*) 
to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(11)* %7, align 8\n   call void @llvm.assume(i1 %1)\n; └\n; ┌ @ 
div.jl:227 within `cld'\n; │┌ @ div.jl:270 within `div' @ div.jl:215 @ int.jl:230\n    %8 = icmp eq i64 %0, 0\n    br i1 %8, label %fail, label %pass\n\nfail:                                             ; preds = %top\n    call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception to i64))\n    call fastcc void @gpu_signal_exception()\n    call void asm sideeffect \"exit;\", \"\"() #1\n    br label %pass\n\npass:                                             ; preds = %top, %fail\n    %9 = udiv i64 42, %0\n; 
││ @ div.jl:271 within `div'\n; ││┌ @ int.jl:54 within `*'\n     %10 = mul i64 %9, %0\n; ││└\n; ││┌ @ operators.jl:193 within `!='\n; │││┌ @ promotion.jl:398 within `=='\n      %11 = icmp ne i64 %10, 42\n; ││└└\n; ││┌ @ int.jl:858 within `+'\n; │││┌ @ int.jl:442 within `rem'\n; ││││┌ @ number.jl:7 within `convert'\n; │││││┌ @ boot.jl:707 within `Int64'\n; ││││││┌ @ boot.jl:634 within `toInt64'\n         %12 = zext i1 %11 to i64\n; │││└└└└\n; │││ @ int.jl:860 within `+' @ int.jl:53\n     %13 = add nuw nsw i64 %9, %12\n; └└└\n  ret i64 %13\n}\n"))
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\andre\.julia\packages\CUDA\Wfh4d\test\runtests.jl:442
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484

then a lot of those

Worker 13 failed running test execution:
Some tests did not pass: 72 passed, 8 failed, 0 errored, 0 broken.
execution: Test Failed at C:\Users\andre\.julia\packages\CUDA\Wfh4d\test\execution.jl:568
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\andre\.julia\packages\CUDA\Wfh4d\test\runtests.jl:442
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484

The actual result

Test Summary:                     | Pass  Fail  Error  Total
  Overall                         |   89     9     53    151
    initialization                |                 1      1
    apiutils                      |                 1      1
    codegen                       |   17     1            18
    curand                        |                 1      1
    iterator                      |                 1      1
    cufft                         |                 1      1
    broadcast                     |                 1      1
    memory                        |                 1      1
    nvml                          |                 1      1
    cusparse                      |                 1      1
    nnlib                         |                 1      1
    nvtx                          |                 1      1
    pointer                       |                 1      1
    array                         |                 1      1
    cublas                        |                 1      1
    forwarddiff                   |                 1      1
    cusolver                      |                 1      1
    utils                         |                 1      1
    cudadrv\context               |                 1      1
    cudadrv\devices               |                 1      1
    threading                     |                 1      1
    cudadrv\errors                |                 1      1
    statistics                    |                 1      1
    cudadrv\events                |                 1      1
    cudadrv\execution             |                 1      1
    cudadrv\memory                |                 1      1
    cudadrv\occupancy             |                 1      1
    cudadrv\module                |                 1      1
    cudadrv\profile               |                 1      1
    texture                       |                 1      1
    cudadrv\stream                |                 1      1
    cudadrv\version               |                 1      1
    device\array                  |                 1      1
    examples                      |                 1      1
    gpuarrays/input output        |                 1      1
    device\pointer                |                 1      1
    cusolver\cusparse             |                 1      1
    gpuarrays/math                |                 1      1
    execution                     |   72     8            80
    gpuarrays/interface           |                 1      1
    gpuarrays/value constructors  |                 1      1
    gpuarrays/indexing            |                 1      1
    gpuarrays/conversions         |                 1      1
    gpuarrays/iterator constructors |                 1      1
    gpuarrays/constructors        |                 1      1
    gpuarrays/uniformscaling      |                 1      1
    gpuarrays/fft                 |                 1      1
    gpuarrays/base                |                 1      1
    gpuarrays/random              |                 1      1
    device\intrinsics             |                 1      1
    gpuarrays/linear algebra      |                 1      1
    gpuarrays/broadcasting        |                 1      1
    gpuarrays/mapreduce essentials |                 1      1
    gpuarrays/mapreduce (old tests) |                 1      1
    gpuarrays/mapreduce derivatives |                 1      1

I case I missed something important. here whole thing

Weirdly, the process IDs returned there aren’t valid. Could you try the following code:

julia> using CUDA

julia> CuArray([1]); # to initialize CUDA

julia> using .NVML

julia> count_ref = Ref{Cuint}(0);

julia> dev = first(NVML.devices())
CUDA.NVML.Device(Ptr{Nothing} @0x00007fde5d7aae38)

julia> NVML.nvmlDeviceGetComputeRunningProcesses(dev, count_ref, C_NULL)
ERROR: NVMLError: Insufficient Size (code 7)

julia> # NVML_ERROR_INSUFFICIENT_SIZE is expected here

julia> count_ref[]
0x00000003

julia> infos = Vector{NVML.nvmlProcessInfo_t}(undef, count_ref[]);

julia> NVML.nvmlDeviceGetComputeRunningProcesses(dev, count_ref, infos)

julia> infos
3-element Array{CUDA.NVML.nvmlProcessInfo_st,1}:
 CUDA.NVML.nvmlProcessInfo_st(0x0000cb97, 0x000000000cd00000)
 CUDA.NVML.nvmlProcessInfo_st(0x0000fe64, 0x0000000008f00000)
 CUDA.NVML.nvmlProcessInfo_st(0x00010c92, 0x0000000007500000)

Note that the failures are in the code that reports test output, actual functionality is likely working perfectly.

julia> using CUDA

julia> 

julia> CuArray([1]); # to initialize CUDA

julia> 

julia> using .NVML

julia> 

julia> count_ref = Ref{Cuint}(0);

julia> 

julia> dev = first(NVML.devices())
CUDA.NVML.Device(Ptr{Nothing} @0x00007ffd42659c20)

julia> 

julia> NVML.nvmlDeviceGetComputeRunningProcesses(dev, count_ref, C_NULL)

julia> 

julia> count_ref[]
0x00000000

julia>

julia> infos = Vector{NVML.nvmlProcessInfo_t}(undef, count_ref[]);

julia>

julia> NVML.nvmlDeviceGetComputeRunningProcesses(dev, count_ref, infos)

julia>

julia> infos
0-element Array{CUDA.NVML.nvmlProcessInfo_st,1}

Looks like CUDA didn’t initialize. Can you add a @show infos in compute_processes before it errors on line 123 (and then run the tests)?

all workers seem to report this

From worker 2:    infos = CUDA.NVML.nvmlProcessInfo_st[CUDA.NVML.nvmlProcessInfo_st(0x00002fb0, 0xffffffffffffffff), 
CUDA.NVML.nvmlProcessInfo_st(0x00004e94, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x000084ec, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00002d28, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x0000808c, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00009114, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x0000971c, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00000fe8, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00003b30, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00004e98, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00007a1c, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00002084, 0xffffffffffffffff)]

Aha, it’s not the PID but the usedGpuMemory that’s unavailable here. I’ll push a fix.

       unsigned long long usedGpuMemory
           Amount of used GPU memory in bytes. Under WDDM, NVML_VALUE_NOT_AVAILABLE is always
           reported because Windows KMD manages all the memory and not the NVIDIA driver.

Yes, above is helping. only some tests are failing now

Status `C:\Users\andre\AppData\Local\Temp\jl_zJKTY5\Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0 [`C:\Src\cuda\cuda\dev\CUDA`]
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1 #master (https://github.com/JuliaGPU/GPUArrays.jl)
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.19
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.1
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.17
  [476501e8] SLEEFPirates v0.5.5
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.25
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test 
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
┌ Info: System information:
│ CUDA toolkit 11.0.194, artifact installation
│ CUDA driver 11.0.0
│ NVIDIA driver 451.67.0
│ 
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: 13.0.0
│ - NVML: 11.0.0+451.67
│ - CUDNN: missing
│ - CUTENSOR: missing
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ Environment:
│ - JULIA_CUDA_VERBOSE: true
│ 
│ 2 device(s):
│ - GeForce RTX 2070 (sm_75, 7.843 GiB / 8.000 GiB available)
└ - GeForce RTX 2080 Ti (sm_75, 10.790 GiB / 11.000 GiB available)
[ Info: Testing using 1 device(s): 2. GeForce RTX 2080 Ti (UUID 4d228dab-1287-5f37-f4db-e53f82e2bc60)
[ Info: Skipping the following tests: cudnn, cutensor, device\wmma
                                      |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                         (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                    (2) |     3.65 |   0.00 |  0.0 |       0.00 |      N/A |   0.05 |  1.4 |     199.20 |   562.26 |
apiutils                          (3) |     0.81 |   0.00 |  0.0 |       0.00 |      N/A |   0.03 |  3.5 |      85.70 |   552.87 |
iterator                          (3) |     2.71 |   0.27 | 10.0 |       1.25 |      N/A |   0.09 |  3.4 |     281.28 |   552.87 |
memory                            (3) |     1.53 |   0.00 |  0.0 |       0.00 |      N/A |   0.33 | 21.2 |     142.82 |   552.87 |
codegen                           (6) |         failed at 2020-07-24T20:12:39.053
nnlib                             (3) |    13.92 |   0.15 |  1.1 |       0.00 |      N/A |   0.59 |  4.2 |    1550.51 |   980.92 |
nvml                              (3) |     0.51 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  3.4 |      58.52 |   986.84 |
nvtx                              (3) |     0.49 |   0.00 |  0.0 |       0.00 |      N/A |   0.03 |  6.8 |      63.11 |   990.04 |
pointer                           (3) |     0.11 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       7.06 |   990.66 |
curand                            (9) |    23.48 |   0.26 |  1.1 |       0.02 |      N/A |   1.18 |  5.0 |    2416.18 |   634.97 |
statistics                        (3) |    16.18 |   0.00 |  0.0 |       0.00 |      N/A |   0.81 |  5.0 |    1582.73 |  1046.86 |
utils                             (3) |     0.85 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  2.7 |     118.93 |  1046.86 |
cufft                             (8) |    38.56 |   0.27 |  0.7 |     133.23 |      N/A |   1.97 |  5.1 |    4257.42 |  1189.55 |
cudadrv\context                   (3) |     0.58 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  2.8 |      32.43 |  1099.50 |
cudadrv\devices                   (8) |     0.34 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      40.12 |  1189.55 |
broadcast                         (5) |    39.32 |   0.27 |  0.7 |       0.00 |      N/A |   1.56 |  4.0 |    3605.98 |   677.96 |
cudadrv\errors                    (3) |     0.15 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      21.96 |  1099.50 |
cudadrv\events                    (8) |     0.15 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      14.93 |  1189.55 |
cudadrv\module                    (8) |     0.37 |   0.00 |  0.0 |       0.00 |      N/A |   0.01 |  3.2 |      46.07 |  1189.55 |
cudadrv\occupancy                 (8) |     0.11 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       8.55 |  1189.55 |
cudadrv\profile                   (8) |     0.30 |   0.00 |  0.0 |       0.00 |      N/A |   0.01 |  4.8 |      49.13 |  1189.55 |
cudadrv\execution                 (5) |     1.01 |   0.00 |  0.0 |       0.00 |      N/A |   0.05 |  5.4 |     108.96 |   677.96 |
cudadrv\version                   (5) |     0.01 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       0.08 |   677.96 |
cudadrv\stream                    (8) |     0.21 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  9.0 |      20.06 |  1189.55 |
cudadrv\memory                    (3) |     2.10 |   0.00 |  0.0 |       0.00 |      N/A |   0.07 |  3.5 |     206.39 |  1099.50 |
device\array                      (8) |     3.58 |   0.00 |  0.0 |       0.00 |      N/A |   0.10 |  2.8 |     339.27 |  1194.57 |
texture                           (9) |    26.26 |   0.00 |  0.0 |       0.08 |      N/A |   0.92 |  3.5 |    2530.68 |   720.41 |
device\pointer                    (8) |     7.09 |   0.00 |  0.0 |       0.00 |      N/A |   0.20 |  2.9 |     568.32 |  1208.79 |
gpuarrays/math                    (8) |     2.93 |   0.00 |  0.0 |       0.00 |      N/A |   0.09 |  3.2 |     306.95 |  1217.26 |
gpuarrays/input output            (8) |     1.19 |   0.00 |  0.0 |       0.00 |      N/A |   0.04 |  3.7 |     126.49 |  1217.26 |
cusparse                         (11) |    56.99 |   0.32 |  0.6 |       4.46 |      N/A |   2.41 |  4.2 |    6253.25 |   967.00 |
threading                        (14) |    16.95 |   4.44 | 26.2 |      56.25 |      N/A |   0.83 |  4.9 |    1801.75 |  1274.68 |
forwarddiff                       (2) |    58.55 |   0.30 |  0.5 |       0.00 |      N/A |   1.32 |  2.3 |    3334.10 |   737.97 |
gpuarrays/interface              (11) |     3.34 |   0.00 |  0.0 |       0.00 |      N/A |   0.11 |  3.4 |     253.39 |   975.28 |
cusolver\cusparse                 (5) |    19.97 |   0.00 |  0.0 |       0.19 |      N/A |   0.84 |  4.2 |    2063.14 |  1441.06 |
gpuarrays/value constructors      (8) |     7.89 |   0.00 |  0.0 |       0.00 |      N/A |   0.17 |  2.2 |     573.97 |  1238.47 |
gpuarrays/fft                     (8) |     1.24 |   0.00 |  0.1 |       6.01 |      N/A |   0.04 |  3.3 |     133.59 |  1247.52 |
gpuarrays/conversions             (5) |     4.46 |   0.00 |  0.0 |       0.01 |      N/A |   0.19 |  4.3 |     626.51 |  1448.38 |
gpuarrays/constructors            (8) |     1.42 |   0.00 |  0.2 |       0.04 |      N/A |   0.01 |  0.9 |      76.52 |  1249.82 |
gpuarrays/indexing                (9) |    17.71 |   0.00 |  0.0 |       0.12 |      N/A |   0.57 |  3.2 |    1458.04 |   750.68 |
gpuarrays/uniformscaling          (2) |     8.12 |   0.00 |  0.0 |       0.01 |      N/A |   0.26 |  3.2 |     762.20 |   737.97 |
array                             (4) |    73.13 |   0.20 |  0.3 |       5.20 |      N/A |   2.67 |  3.7 |    6785.38 |   784.63 |
cublas                            (7) |    75.23 |   0.35 |  0.5 |      11.12 |      N/A |   3.19 |  4.2 |    8831.23 |  1199.72 |
gpuarrays/iterator constructors  (14) |    19.32 |   0.00 |  0.0 |       0.02 |      N/A |   0.62 |  3.2 |    1829.22 |  1278.43 |
gpuarrays/base                    (8) |    15.86 |   0.00 |  0.0 |      17.61 |      N/A |   0.60 |  3.8 |    1872.71 |  1348.77 |
cusolver                         (10) |    83.64 |   0.42 |  0.5 |    1229.85 |      N/A |   3.46 |  4.1 |    8887.82 |  1574.50 |
gpuarrays/random                  (5) |    19.85 |   0.00 |  0.0 |       0.00 |      N/A |   0.53 |  2.7 |    1656.92 |  1505.24 |
device\intrinsics                 (3) |    67.58 |   0.00 |  0.0 |       0.01 |      N/A |   1.30 |  1.9 |    5025.86 |  1201.33 |
gpuarrays/linear algebra         (11) |    48.93 |   0.01 |  0.0 |       1.42 |      N/A |   1.53 |  3.1 |    5037.09 |  1463.40 |
examples                         (12) |   124.64 |   0.00 |  0.0 |       0.00 |      N/A |   0.08 |  0.1 |     132.75 |   562.43 |
gpuarrays/broadcasting            (2) |    60.48 |   0.00 |  0.0 |       1.19 |      N/A |   2.02 |  3.3 |    7448.78 |   853.78 |
execution                        (13) |         failed at 2020-07-24T20:14:35.877
gpuarrays/mapreduce essentials    (9) |    89.02 |   0.01 |  0.0 |       3.19 |      N/A |   2.43 |  2.7 |   10171.24 |   980.95 |
gpuarrays/mapreduce (old tests)   (7) |    92.07 |   0.01 |  0.0 |     130.20 |      N/A |   2.21 |  2.4 |   10276.32 |  1506.56 |
gpuarrays/mapreduce derivatives   (4) |   126.02 |   0.01 |  0.0 |       3.06 |      N/A |   3.02 |  2.4 |   13688.27 |  1117.89 |
Worker 6 failed running test codegen:
Some tests did not pass: 17 passed, 1 failed, 0 errored, 0 broken.
codegen: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\codegen.jl:78
  Expression: !(occursin("gpu_report_exception", ir))
   Evaluated: !(occursin("gpu_report_exception", "\n;  @ C:\\Src\\\cuda\\cuda\\dev\\CUDA\\test\\codegen.jl:76 within `bar'\n; Function Attrs: uwtable\ndefine i64 @julia_bar_19727(i64) local_unnamed_addr #0 {\ntop:\n; ┌ @ operators.jl:294 within `>'\n; │┌ @ int.jl:49 within `<'\n    %1 = icmp sgt i64 %0, 0\n; └└\n; ┌ @ C:\\Src\\\cuda\\cuda\\dev\\CUDA\\src\\device\\llvm.jl:5 within `assume'\n   %2 = call fastcc %jl_value_t addrspace(10)* @gpu_gc_pool_alloc(i64 16)\n   %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*\n   %4 = bitcast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(10)* addrspace(10)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1023020288 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %4, align 8\n   %5 = bitcast %jl_value_t addrspace(11)* %3 to i8 addrspace(11)*\n   %6 = getelementptr inbounds i8, i8 addrspace(11)* %5, i64 8\n   %7 
= bitcast i8 addrspace(11)* %6 to %jl_value_t addrspace(10)* addrspace(11)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 370085648 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(11)* %7, align 8\n   call void @llvm.assume(i1 %1)\n; └\n; ┌ @ div.jl:227 within `cld'\n; │┌ @ div.jl:270 within `div' @ div.jl:215 @ int.jl:230\n    %8 = icmp eq i64 %0, 0\n    br i1 %8, label %fail, label %pass\n\nfail:
                         ; preds = %top\n    call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception to 
i64))\n    call fastcc void @gpu_signal_exception()\n    call void asm sideeffect \"exit;\", \"\"() #1\n    br label %pass\n\npass:                                             ; preds = %top, %fail\n    %9 = udiv i64 42, %0\n; ││ @ div.jl:271 within `div'\n; ││┌ @ int.jl:54 within `*'\n     %10 = mul i64 %9, %0\n; ││└\n; ││┌ @ operators.jl:193 within `!='\n; │││┌ @ promotion.jl:398 within `=='\n      %11 = icmp ne i64 %10, 42\n; ││└└\n; ││┌ @ int.jl:858 within `+'\n; │││┌ @ int.jl:442 within `rem'\n; ││││┌ @ number.jl:7 within `convert'\n; │││││┌ @ boot.jl:707 within `Int64'\n; ││││││┌ @ boot.jl:634 within `toInt64'\n         %12 = zext i1 %11 to i64\n; │││└└└└\n; │││ @ int.jl:860 within `+' @ int.jl:53\n     %13 = add nuw nsw i64 %9, %12\n; └└└\n  ret i64 %13\n}\n"))
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
Worker 13 failed running test execution:
Some tests did not pass: 72 passed, 8 failed, 0 errored, 0 broken.
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:568
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:569
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:575
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:576
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:582
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:583
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:609
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:610
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484

Test Summary:                     | Pass  Fail  Broken  Total
  Overall                         | 8167     9       2   8178
    initialization                |   11                   11
    apiutils                      |   15                   15
    iterator                      |   30                   30
    memory                        |   10                   10
    codegen                       |   17     1             18
    nnlib                         |    3                    3
    nvml                          |    7                    7
    nvtx                          |                     No tests
    pointer                       |   13                   13
    curand                        |  102                  102
    statistics                    |   12                   12
    utils                         |    5                    5
    cufft                         |  151                  151
    cudadrv\context               |   12                   12
    cudadrv\devices               |    5                    5
    broadcast                     |   29                   29
    cudadrv\errors                |    6                    6
    cudadrv\events                |    6                    6
    cudadrv\module                |   11                   11
    cudadrv\occupancy             |    1                    1
    cudadrv\profile               |    2                    2
    cudadrv\execution             |   15                   15
    cudadrv\version               |    3                    3
    cudadrv\stream                |    7                    7
    cudadrv\memory                |   50             1     51
    device\array                  |   20                   20
    texture                       |   26             1     27
    device\pointer                |   57                   57
    gpuarrays/math                |    8                    8
    gpuarrays/input output        |    5                    5
    cusparse                      |  468                  468
    threading                     |                     No tests
    forwarddiff                   |  106                  106
    gpuarrays/interface           |    7                    7
    cusolver\cusparse             |   84                   84
    gpuarrays/value constructors  |  120                  120
    gpuarrays/fft                 |   12                   12
    gpuarrays/conversions         |   72                   72
    gpuarrays/constructors        |  335                  335
    gpuarrays/indexing            |  113                  113
    gpuarrays/uniformscaling      |   56                   56
    array                         |  154                  154
    cublas                        | 1885                 1885
    gpuarrays/iterator constructors |   24                   24
    gpuarrays/base                |   38                   38
    cusolver                      | 1493                 1493
    gpuarrays/random              |   40                   40
    device\intrinsics             |  265                  265
    gpuarrays/linear algebra      |  393                  393
    examples                      |    7                    7
    gpuarrays/broadcasting        |  155                  155
    execution                     |   72     8             80
    gpuarrays/mapreduce essentials |  522                  522
    gpuarrays/mapreduce (old tests) |  297                  297
    gpuarrays/mapreduce derivatives |  810                  810
    FAILURE

Error in testset codegen:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\codegen.jl:78
  Expression: !(occursin("gpu_report_exception", ir))
   Evaluated: !(occursin("gpu_report_exception", "\n;  @ C:\\Src\\\cuda\\cuda\\dev\\CUDA\\test\\codegen.jl:76 within `bar'\n; Function Attrs: uwtable\ndefine i64 @julia_bar_19727(i64) local_unnamed_addr #0 {\ntop:\n; ┌ @ operators.jl:294 within `>'\n; │┌ @ int.jl:49 within `<'\n    %1 = icmp sgt i64 %0, 0\n; └└\n; ┌ @ C:\\Src\\\cuda\\cuda\\dev\\CUDA\\src\\device\\llvm.jl:5 within `assume'\n   %2 = call fastcc %jl_value_t addrspace(10)* @gpu_gc_pool_alloc(i64 16)\n   %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*\n   %4 = bitcast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(10)* addrspace(10)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1023020288 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %4, align 8\n   %5 = bitcast %jl_value_t addrspace(11)* %3 to i8 addrspace(11)*\n   %6 = getelementptr inbounds i8, i8 addrspace(11)* %5, i64 8\n   %7 
= bitcast i8 addrspace(11)* %6 to %jl_value_t addrspace(10)* addrspace(11)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 370085648 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(11)* %7, align 8\n   call void @llvm.assume(i1 %1)\n; └\n; ┌ @ div.jl:227 within `cld'\n; │┌ @ div.jl:270 within `div' @ div.jl:215 @ int.jl:230\n    %8 = icmp eq i64 %0, 0\n    br i1 %8, label %fail, label %pass\n\nfail:
                         ; preds = %top\n    call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception to 
i64))\n    call fastcc void @gpu_signal_exception()\n    call void asm sideeffect \"exit;\", \"\"() #1\n    br label %pass\n\npass:                                             ; preds = %top, %fail\n    %9 = udiv i64 42, %0\n; ││ @ div.jl:271 within `div'\n; ││┌ @ int.jl:54 within `*'\n     %10 = mul i64 %9, %0\n; ││└\n; ││┌ @ operators.jl:193 within `!='\n; │││┌ @ promotion.jl:398 within `=='\n      %11 = icmp ne i64 %10, 42\n; ││└└\n; ││┌ @ int.jl:858 within `+'\n; │││┌ @ int.jl:442 within `rem'\n; ││││┌ @ number.jl:7 within `convert'\n; │││││┌ @ boot.jl:707 within `Int64'\n; ││││││┌ @ boot.jl:634 within `toInt64'\n         %12 = zext i1 %11 to i64\n; │││└└└└\n; │││ @ int.jl:860 within `+' @ int.jl:53\n     %13 = add nuw nsw i64 %9, %12\n; └└└\n  ret i64 %13\n}\n"))
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:568
  Expression: code == 1
   Evaluated: 0 == 1
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:569
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:575
  Expression: code == 1
   Evaluated: 0 == 1
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:576
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:582
  Expression: code == 1
   Evaluated: 0 == 1
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:583
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:609
  Expression: code == 1
   Evaluated: 0 == 1
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:610
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
ERROR: LoadError: Test run finished with errors
in expression starting at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:475

Interesting failures. The assume I haven’t seen before, it’s weird how your LLVM optimizes differently, but the functionality isn’t used so the failure is harmless. The other failures are harmless too, although they might indicate a CUDA driver bug. I’ve created https://github.com/JuliaGPU/CUDA.jl/pull/319 to make the test more robust and hopefully pass that test on your system.

I still see errors in test

Status `C:\Users\andre\AppData\Local\Temp\jl_hFTchO\Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0 [`C:\Users\andre\.julia\dev\CUDA`]
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1 #master (https://github.com/JuliaGPU/GPUArrays.jl)
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.20
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.2
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.21
  [476501e8] SLEEFPirates v0.5.5
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.29
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA 
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
┌ Info: System information:
│ CUDA toolkit 11.0.194, artifact installation
│ CUDA driver 11.0.0
│ NVIDIA driver 451.67.0
│ 
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: 13.0.0
│ - NVML: 11.0.0+451.67
│ - CUDNN: missing
│ - CUTENSOR: missing
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ Environment:
│ - JULIA_CUDA_VERBOSE: true
│ 
│ 2 device(s):
│ - GeForce RTX 2070 (sm_75, 7.843 GiB / 8.000 GiB available)
└ - GeForce RTX 2080 Ti (sm_75, 10.790 GiB / 11.000 GiB available)
[ Info: Testing using 1 device(s): 2. GeForce RTX 2080 Ti (UUID 4d228dab-1287-5f37-f4db-e53f82e2bc60)
[ Info: Skipping the following tests: cudnn, cutensor, device\wmma
                                      |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                         (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
apiutils                          (3) |     0.56 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  3.6 |      85.74 |   541.16 |
initialization                    (2) |     3.57 |   0.00 |  0.0 |       0.00 |      N/A |   0.07 |  1.8 |     199.58 |   566.14 |
iterator                          (2) |     2.69 |   0.29 | 10.7 |       1.25 |      N/A |   0.11 |  4.2 |     283.23 |   566.14 |
memory                            (2) |     1.67 |   0.00 |  0.0 |       0.00 |      N/A |   0.36 | 21.8 |     146.26 |   566.14 |
codegen                           (6) |         failed at 2020-07-27T14:09:49.998
nnlib                             (2) |    14.42 |   0.15 |  1.0 |       0.00 |      N/A |   0.56 |  3.9 |    1549.00 |  1073.51 |
nvml                              (2) |     0.51 |   0.00 |  0.0 |       0.00 |      N/A |   0.03 |  5.6 |      58.10 |  1073.51 |
nvtx                              (2) |     0.48 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  3.3 |      74.69 |  1073.51 |
pointer                           (2) |     0.12 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       7.09 |  1073.51 |
curand                            (9) |    23.46 |   0.29 |  1.2 |       0.02 |      N/A |   1.25 |  5.3 |    2369.98 |   604.86 |
statistics                        (2) |    16.12 |   0.00 |  0.0 |       0.00 |      N/A |   0.84 |  5.2 |    1625.56 |  1090.46 |
broadcast                         (5) |    38.18 |   0.30 |  0.8 |       0.00 |      N/A |   1.58 |  4.1 |    3502.24 |   675.89 |
cufft                             (8) |    39.44 |   0.30 |  0.8 |     133.23 |      N/A |   1.98 |  5.0 |    4259.85 |  1181.87 |
utils                             (5) |     1.21 |   0.00 |  0.0 |       0.00 |      N/A |   0.07 |  5.9 |     151.43 |   675.89 |
cudadrv\errors                    (5) |     0.16 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      22.49 |   675.89 |
cudadrv\devices                   (8) |     0.37 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      39.87 |  1181.87 |
cudadrv\events                    (5) |     0.15 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      14.25 |   675.89 |
cudadrv\execution                 (8) |     1.47 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  1.1 |      74.55 |  1183.48 |
threading                         (2) |     4.16 |   0.92 | 22.0 |      56.25 |      N/A |   0.11 |  2.6 |     469.27 |  1218.40 |
cudadrv\module                    (8) |     0.59 |   0.00 |  0.0 |       0.00 |      N/A |   0.01 |  2.1 |      42.89 |  1185.80 |
cudadrv\occupancy                 (2) |     0.10 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       8.42 |  1218.86 |
cudadrv\profile                   (8) |     0.31 |   0.00 |  0.0 |       0.00 |      N/A |   0.01 |  4.7 |      49.02 |  1185.80 |
cudadrv\stream                    (2) |     0.30 |   0.00 |  0.2 |       0.00 |      N/A |   0.05 | 17.8 |      29.84 |  1218.96 |
cudadrv\version                   (8) |     0.01 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       0.08 |  1185.80 |
cudadrv\memory                    (5) |     2.89 |   0.00 |  0.0 |       0.00 |      N/A |   0.09 |  3.0 |     211.58 |   678.09 |
cudadrv\context                  (14) |     1.18 |   0.00 |  0.0 |       0.00 |      N/A |   0.04 |  3.3 |     121.68 |   616.20 |
device\array                      (8) |     3.60 |   0.00 |  0.0 |       0.00 |      N/A |   0.09 |  2.4 |     336.02 |  1196.71 |
texture                           (9) |    25.67 |   0.00 |  0.0 |       0.08 |      N/A |   1.07 |  4.2 |    2776.62 |   734.56 |
gpuarrays/math                    (9) |     2.83 |   0.00 |  0.0 |       0.00 |      N/A |   0.09 |  3.3 |     314.90 |   734.91 |
gpuarrays/input output            (9) |     1.29 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  1.7 |     127.70 |   734.91 |
cusparse                         (11) |    57.69 |   0.31 |  0.5 |       4.46 |      N/A |   2.54 |  4.4 |    6234.63 |   967.88 |
forwarddiff                       (3) |    59.99 |   0.38 |  0.6 |       0.00 |      N/A |   1.47 |  2.5 |    3475.23 |   719.55 |
cusolver\cusparse                 (2) |    17.83 |   0.01 |  0.0 |       0.19 |      N/A |   0.77 |  4.3 |    1926.42 |  1684.84 |
gpuarrays/interface              (11) |     2.81 |   0.00 |  0.0 |       0.00 |      N/A |   0.10 |  3.5 |     263.06 |   975.27 |
gpuarrays/value constructors      (9) |     7.52 |   0.00 |  0.0 |       0.00 |      N/A |   0.20 |  2.7 |     650.63 |   753.97 |
device\pointer                   (14) |    17.74 |   0.29 |  1.6 |       0.00 |      N/A |   0.82 |  4.6 |    1818.78 |   616.20 |
gpuarrays/conversions             (9) |     4.16 |   0.00 |  0.0 |       0.01 |      N/A |   0.21 |  5.0 |     581.19 |   767.15 |
gpuarrays/indexing                (8) |    19.63 |   0.00 |  0.0 |       0.13 |      N/A |   0.71 |  3.6 |    1936.64 |  1286.61 |
gpuarrays/constructors            (9) |     1.53 |   0.00 |  0.2 |       0.04 |      N/A |   0.03 |  1.7 |      86.16 |   777.34 |
gpuarrays/uniformscaling          (2) |     7.45 |   0.00 |  0.0 |       0.01 |      N/A |   0.20 |  2.7 |     630.02 |  1684.84 |
array                             (4) |    71.59 |   0.30 |  0.4 |       5.20 |      N/A |   2.88 |  4.0 |    6732.22 |   774.98 |
gpuarrays/fft                    (14) |     7.95 |   0.00 |  0.0 |       6.01 |      N/A |   0.33 |  4.2 |     872.91 |  1176.69 |
cublas                            (7) |    76.81 |   0.36 |  0.5 |      11.12 |      N/A |   3.34 |  4.3 |    8808.45 |  1240.90 |
gpuarrays/iterator constructors   (3) |    19.00 |   0.00 |  0.0 |       0.02 |      N/A |   0.73 |  3.8 |    2004.91 |   758.60 |
cusolver                         (10) |    82.26 |   0.40 |  0.5 |    1229.85 |      N/A |   3.64 |  4.4 |    8737.55 |  1571.16 |
gpuarrays/base                    (9) |    16.35 |   0.00 |  0.0 |      17.61 |      N/A |   0.75 |  4.6 |    1867.16 |   856.94 |
gpuarrays/random                  (8) |    20.07 |   0.00 |  0.0 |       0.00 |      N/A |   0.51 |  2.6 |    1659.26 |  1369.74 |
device\intrinsics                 (5) |    67.23 |   0.00 |  0.0 |       0.01 |      N/A |   1.40 |  2.1 |    4967.24 |   859.46 |
gpuarrays/linear algebra         (11) |    57.55 |   0.01 |  0.0 |       1.42 |      N/A |   1.49 |  2.6 |    5148.27 |  1519.13 |
examples                         (12) |   127.03 |   0.00 |  0.0 |       0.00 |      N/A |   0.09 |  0.1 |     132.75 |   555.79 |
gpuarrays/broadcasting            (4) |    57.83 |   0.00 |  0.0 |       1.19 |      N/A |   1.97 |  3.4 |    7069.32 |   933.43 |
execution                        (13) |         failed at 2020-07-27T14:11:53.175
gpuarrays/mapreduce (old tests)   (7) |    84.84 |   0.01 |  0.0 |     130.20 |      N/A |   2.53 |  3.0 |   10279.09 |  1458.08 |
gpuarrays/mapreduce essentials    (2) |    93.39 |   0.01 |  0.0 |       3.19 |      N/A |   3.03 |  3.2 |   12281.01 |  1899.16 |
gpuarrays/mapreduce derivatives  (14) |   127.34 |   0.01 |  0.0 |       3.06 |      N/A |   3.44 |  2.7 |   15804.64 |  1610.81 |
Worker 6 failed running test codegen:
Some tests did not pass: 17 passed, 1 failed, 0 errored, 0 broken.
codegen: Test Failed at C:\Users\andre\.julia\dev\CUDA\test\codegen.jl:78
  Expression: !(occursin("gpu_report_exception", ir))
   Evaluated: !(occursin("gpu_report_exception", "\n;  @ C:\\Users\\andre\\.julia\\dev\\CUDA\\test\\codegen.jl:76 within `bar'\n; Function Attrs: uwtable\ndefine i64 @julia_bar_19729(i64) local_unnamed_addr #0 {\ntop:\n; ┌ @ operators.jl:294 within `>'\n; │┌ @ int.jl:49 within `<'\n    %1 = icmp sgt i64 %0, 0\n; └└\n; ┌ @ C:\\Users\\andre\\.julia\\dev\\CUDA\\src\\device\\llvm.jl:5 within `assume'\n   %2 = call fastcc %jl_value_t addrspace(10)* @gpu_gc_pool_alloc(i64 16)\n   %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*\n   %4 = bitcast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(10)* addrspace(10)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1035379872 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %4, align 8\n   %5 = bitcast %jl_value_t addrspace(11)* %3 to i8 addrspace(11)*\n   %6 = getelementptr inbounds i8, i8 addrspace(11)* %5, i64 8\n   %7 = 
bitcast i8 addrspace(11)* %6 to %jl_value_t addrspace(10)* addrspace(11)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1033654928 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(11)* %7, align 8\n   call void @llvm.assume(i1 %1)\n; └\n; ┌ @ div.jl:227 within `cld'\n; │┌ @ div.jl:270 within `div' 
@ div.jl:215 @ int.jl:230\n    %8 = icmp eq i64 %0, 0\n    br i1 %8, label %fail, label %pass\n\nfail:
                        ; preds = %top\n    call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception to i64))\n    call fastcc void @gpu_signal_exception()\n    call void asm sideeffect \"exit;\", \"\"() #1\n    br label %pass\n\npass:                                             ; preds = %top, %fail\n    %9 = udiv i64 42, %0\n; ││ @ div.jl:271 within `div'\n; ││┌ @ int.jl:54 within `*'\n     %10 = mul i64 %9, %0\n; ││└\n; ││┌ @ operators.jl:193 within `!='\n; │││┌ @ promotion.jl:398 within `=='\n      %11 = icmp ne i64 %10, 42\n; ││└└\n; ││┌ @ int.jl:858 within `+'\n; │││┌ @ int.jl:442 within `rem'\n; ││││┌ @ number.jl:7 within `convert'\n; │││││┌ @ boot.jl:707 within `Int64'\n; ││││││┌ @ boot.jl:634 within `toInt64'\n         %12 = zext i1 %11 to i64\n; │││└└└└\n; │││ @ int.jl:860 within `+' @ int.jl:53\n     %13 = add nuw nsw i64 %9, %12\n; └└└\n  ret i64 %13\n}\n"))
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\andre\.julia\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
Worker 13 failed running test execution:
Some tests did not pass: 78 passed, 2 failed, 0 errored, 0 broken.
execution: Test Failed at C:\Users\andre\.julia\dev\CUDA\test\execution.jl:617
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\andre\.julia\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Users\andre\.julia\dev\CUDA\test\execution.jl:618
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\andre\.julia\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484

Test Summary:                     | Pass  Fail  Broken  Total
  Overall                         | 8173     3       2   8178
    apiutils                      |   15                   15
    initialization                |   11                   11
    iterator                      |   30                   30
    memory                        |   10                   10
    codegen                       |   17     1             18
    nnlib                         |    3                    3
    nvml                          |    7                    7
    nvtx                          |                     No tests
    pointer                       |   13                   13
    curand                        |  102                  102
    statistics                    |   12                   12
    broadcast                     |   29                   29
    cufft                         |  151                  151
    utils                         |    5                    5
    cudadrv\errors                |    6                    6
    cudadrv\devices               |    5                    5
    cudadrv\events                |    6                    6
    cudadrv\execution             |   15                   15
    threading                     |                     No tests
    cudadrv\module                |   11                   11
    cudadrv\occupancy             |    1                    1
    cudadrv\profile               |    2                    2
    cudadrv\stream                |    7                    7
    cudadrv\version               |    3                    3
    cudadrv\memory                |   50             1     51
    cudadrv\context               |   12                   12
    device\array                  |   20                   20
    texture                       |   26             1     27
    gpuarrays/math                |    8                    8
    gpuarrays/input output        |    5                    5
    cusparse                      |  468                  468
    forwarddiff                   |  106                  106
    cusolver\cusparse             |   84                   84
    gpuarrays/interface           |    7                    7
    gpuarrays/value constructors  |  120                  120
    device\pointer                |   57                   57
    gpuarrays/conversions         |   72                   72
    gpuarrays/indexing            |  113                  113
    gpuarrays/constructors        |  335                  335
    gpuarrays/uniformscaling      |   56                   56
    array                         |  154                  154
    gpuarrays/fft                 |   12                   12
    cublas                        | 1885                 1885
    gpuarrays/iterator constructors |   24                   24
    cusolver                      | 1493                 1493
    gpuarrays/base                |   38                   38
    gpuarrays/random              |   40                   40
    device\intrinsics             |  265                  265
    gpuarrays/linear algebra      |  393                  393
    examples                      |    7                    7
    gpuarrays/broadcasting        |  155                  155
    execution                     |   78     2             80
    gpuarrays/mapreduce (old tests) |  297                  297
    gpuarrays/mapreduce essentials |  522                  522
    gpuarrays/mapreduce derivatives |  810                  810
    FAILURE

Error in testset codegen:
Test Failed at C:\Users\andre\.julia\dev\CUDA\test\codegen.jl:78
  Expression: !(occursin("gpu_report_exception", ir))
   Evaluated: !(occursin("gpu_report_exception", "\n;  @ C:\\Users\\andre\\.julia\\dev\\CUDA\\test\\codegen.jl:76 within `bar'\n; Function Attrs: uwtable\ndefine i64 @julia_bar_19729(i64) local_unnamed_addr #0 {\ntop:\n; ┌ @ operators.jl:294 within `>'\n; │┌ @ int.jl:49 within `<'\n    %1 = icmp sgt i64 %0, 0\n; └└\n; ┌ @ C:\\Users\\andre\\.julia\\dev\\CUDA\\src\\device\\llvm.jl:5 within `assume'\n   %2 = call fastcc %jl_value_t addrspace(10)* @gpu_gc_pool_alloc(i64 16)\n   %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*\n   %4 = bitcast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(10)* addrspace(10)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1035379872 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %4, align 8\n   %5 = bitcast %jl_value_t addrspace(11)* %3 to i8 addrspace(11)*\n   %6 = getelementptr inbounds i8, i8 addrspace(11)* %5, i64 8\n   %7 = 
bitcast i8 addrspace(11)* %6 to %jl_value_t addrspace(10)* addrspace(11)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1033654928 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(11)* %7, align 8\n   call void @llvm.assume(i1 %1)\n; └\n; ┌ @ div.jl:227 within `cld'\n; │┌ @ div.jl:270 within `div' 
@ div.jl:215 @ int.jl:230\n    %8 = icmp eq i64 %0, 0\n    br i1 %8, label %fail, label %pass\n\nfail:
                        ; preds = %top\n    call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception to i64))\n    call fastcc void @gpu_signal_exception()\n    call void asm sideeffect \"exit;\", \"\"() #1\n    br label %pass\n\npass:                                             ; preds = %top, %fail\n    %9 = udiv i64 42, %0\n; ││ @ div.jl:271 within `div'\n; ││┌ @ int.jl:54 within `*'\n     %10 = mul i64 %9, %0\n; ││└\n; ││┌ @ operators.jl:193 within `!='\n; │││┌ @ promotion.jl:398 within `=='\n      %11 = icmp ne i64 %10, 42\n; ││└└\n; ││┌ @ int.jl:858 within `+'\n; │││┌ @ int.jl:442 within `rem'\n; ││││┌ @ number.jl:7 within `convert'\n; │││││┌ @ boot.jl:707 within `Int64'\n; ││││││┌ @ boot.jl:634 within `toInt64'\n         %12 = zext i1 %11 to i64\n; │││└└└└\n; │││ @ int.jl:860 within `+' @ int.jl:53\n     %13 = add nuw nsw i64 %9, %12\n; └└└\n  ret i64 %13\n}\n"))
Error in testset execution:
Test Failed at C:\Users\andre\.julia\dev\CUDA\test\execution.jl:617
  Expression: code == 1
   Evaluated: 0 == 1
Error in testset execution:
Test Failed at C:\Users\andre\.julia\dev\CUDA\test\execution.jl:618
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
ERROR: LoadError: Test run finished with errors
in expression starting at C:\Users\andre\.julia\dev\CUDA\test\runtests.jl:475

And I double checked. I did get the code needed

Yes, those are the llvm.assume differences I mentioned.