CUDA.jl tests on windows

Trying to migrate to CUDA.jl. However can’t make tests passing (both with artifact download and local). Not 100% sure if this is a bug in CUDA.jl or I missed something. Any idea if tests meant to pass on Windows (I understand the main dev env is linux for this lib)?

I created just an empty environment with only CUDA and Distributed packages

test with artifacts

julia> Pkg.test("CUDA")
    Testing CUDA
Status `C:\Users\andre\AppData\Local\Temp\jl_OtYeSM\Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.18
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.1
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.16
  [476501e8] SLEEFPirates v0.5.4
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.24
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [6462fe0b] Sockets 
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
┌ Warning: NNPACK not available for your platform: Windows(x86_64-w64-mingw32-libgfortran5-cxx11)
│         You will be able to use only the default Julia NNlib backend
└ @ NNlib C:\Users\andre\.julia\packages\NNlib\sSn9M\src\NNlib.jl:14
┌ Debug: Initializing CUDA driver
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:91
┌ Debug: Trying to use artifacts...
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\h38pe\deps\bindeps.jl:121
┌ Debug: Selecting artifacts based on driver version 11.0.0
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\h38pe\deps\bindeps.jl:130
ERROR: LoadError: AssertionError: isfile(__libcupti[])
Stacktrace:
 [1] use_artifact_cuda() at C:\Users\andre\.julia\packages\CUDA\h38pe\deps\bindeps.jl:157
 [2] __init_dependencies__() at C:\Users\andre\.julia\packages\CUDA\h38pe\deps\bindeps.jl:314
 [3] __runtime_init__() at C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:112
 [4] (::CUDA.var"#609#610"{Bool})() at C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:32
 [5] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at .\lock.jl:161
 [6] _functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:26
 [7] functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:19
 [8] macro expansion at C:\Users\andre\.julia\packages\CUDA\h38pe\src\initialization.jl:50 [inlined]
 [9] toolkit_release() at C:\Users\andre\.julia\packages\CUDA\h38pe\deps\bindeps.jl:32
 [10] top-level scope at C:\Users\andre\.julia\packages\CUDA\h38pe\test\runtests.jl:112
 [11] include(::String) at .\client.jl:439
 [12] top-level scope at none:6
in expression starting at C:\Users\andre\.julia\packages\CUDA\h38pe\test\runtests.jl:112
ERROR: Package CUDA errored during testing
Stacktrace:
 [1] pkgerror(::String, ::Vararg{String,N} where N) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Types.jl:53   
 [2] test(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Operations.jl:1510
 [3] test(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}; coverage::Bool, test_fn::Nothing, julia_args::Cmd, test_args::Cmd, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:316
 [4] test(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:303
 [5] #test#68 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:297 [inlined]
 [6] test at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:297 [inlined]
 [7] #test#67 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:296 [inlined]
 [8] test at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:296 [inlined]
 [9] test(::String; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:295
 [10] test(::String) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\API.jl:295
 [11] top-level scope at none:0

aj

ok. Maybe there is something going on with artifact download. So I try using the ones coming with local CUDA installation from Nvidia. here is that I am getting

Status `C:\Users\andre\AppData\Local\Temp\jl_zbZb7L\Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2     
  [b552c78f] DiffRules v1.0.1       
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.18
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.1
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.16
  [476501e8] SLEEFPirates v0.5.4
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.24
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown 
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
┌ Info: System information:
│ CUDA toolkit 11.0.194, local installation
│ CUDA driver 11.0.0
│ NVIDIA driver 451.67.0
│ 
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: missing
│ - NVML: 11.0.0+451.67
│ - CUDNN: missing
│ - CUTENSOR: missing
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ Environment:
│ - JULIA_CUDA_VERBOSE: true
│ - JULIA_CUDA_USE_BINARYBUILDER: false
│ 
│ 2 device(s):
│ - GeForce RTX 2070 (sm_75, 7.843 GiB / 8.000 GiB available)
└ - GeForce RTX 2080 Ti (sm_75, 10.790 GiB / 11.000 GiB available)
[ Info: Testing using device 2 (GeForce RTX 2080 Ti, UUID 4d228dab-1287-5f37-f4db-e53f82e2bc60)
[ Info: Skipping the following tests: cudnn, cutensor, device\wmma
                                      |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                         (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                    (2) |         failed at 2020-07-20T15:51:54.047
apiutils                          (3) |         failed at 2020-07-20T15:51:54.253
codegen                           (6) |         failed at 2020-07-20T15:52:10.367
curand                            (9) |         failed at 2020-07-20T15:52:14.689
iterator                         (15) |         failed at 2020-07-20T15:52:22.522
broadcast                         (5) |         failed at 2020-07-20T15:52:30.061

and enabling debug lines I get much larger printout than discource allows. so here: https://pastebin.com/embed_js/gdFEmLXZ

https://github.com/JuliaGPU/CUDA.jl/issues/300

Should be fixed on CUDA master, would be useful if you could verify that (also requires GPUArrays#master).

with local installation I now get

┌ Error: Exception while generating log record in module CUDA at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\discovery.jl:96
│   exception =
│    MethodError: no method matching iterate(::typeof(names))
│    Closest candidates are:
│      iterate(::Core.SimpleVector) at essentials.jl:603
│      iterate(::Core.SimpleVector, ::Any) at essentials.jl:603
│      iterate(::ExponentialBackOff) at error.jl:253
│      ...
│    Stacktrace:
│     [1] join(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Function, ::String) at .\strings\io.jl:294
│     [2] sprint(::Function, ::Function, ::Vararg{Any,N} where N; context::Nothing, sizehint::Int64) at .\strings\io.jl:105
│     [3] sprint(::Function, ::Function, ::Vararg{Any,N} where N) at .\strings\io.jl:101
│     [4] join(::Function, ::String) at .\strings\io.jl:301
│     [5] macro expansion at .\logging.jl:322 [inlined]
│     [6] find_binary(::String; locations::Array{String,1}) at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\discovery.jl:96
│     [7] find_cuda_binary at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\discovery.jl:224 [inlined]
│     [8] use_local_cuda() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:189
│     [9] __init_dependencies__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:313
│     [10] __runtime_init__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:112
│     [11] (::CUDA.var"#609#610"{Bool})() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:32
│     [12] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at .\lock.jl:161
│     [13] _functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:26
│     [14] functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:19
│     [15] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:50 [inlined]
│     [16] toolkit_release() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:32
│     [17] top-level scope at C:\Users\andre\.julia\packages\CUDA\B3eVM\test\runtests.jl:112
│     [18] include(::String) at .\client.jl:439
│     [19] top-level scope at none:6
│     [20] eval(::Module, ::Any) at .\boot.jl:331
│     [21] exec_options(::Base.JLOptions) at .\client.jl:264
│     [22] _start() at .\client.jl:484
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\discovery.jl:96

Double checking that CUDA and GPUArrays are #master

  [621f4979] AbstractFFTs v0.5.0   
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0 #master (https://github.com/JuliaGPU/CUDA.jl)
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1 #master (https://github.com/JuliaGPU/GPUArrays.jl)
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.19
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.1
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.17
  [476501e8] SLEEFPirates v0.5.5
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.25

with artifacts I get first this

┌ Debug: Selecting artifacts based on driver version 11.0.0
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:138
┌ Debug: CUDA toolkit identified as 11.0.194
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\discovery.jl:322
┌ Debug: Using CUDA 11.0.194 from an artifact at C:\Users\andre\.julia\artifacts\bbabcb11fa2760ee54b821c074b8b77699c260b3
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:177
┌ Debug: Could not use CUDNN from artifacts
│   exception =
│    KeyError: key v"11.0.0" not found
│    Stacktrace:
│     [1] getindex at .\dict.jl:477 [inlined]
│     [2] use_artifact_cudnn(::VersionNumber) at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:248
│     [3] use_artifact_cuda() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:179
│     [4] __init_dependencies__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:309
│     [5] __runtime_init__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:112
│     [6] (::CUDA.var"#609#610"{Bool})() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:32
│     [7] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at .\lock.jl:161
│     [8] _functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:26
│     [9] functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:19
│     [10] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:50 [inlined]
│     [11] toolkit_release() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:32
│     [12] top-level scope at C:\Users\andre\.julia\packages\CUDA\B3eVM\test\runtests.jl:112
│     [13] include(::String) at .\client.jl:439
│     [14] top-level scope at none:6
│     [15] eval(::Module, ::Any) at .\boot.jl:331
│     [16] exec_options(::Base.JLOptions) at .\client.jl:264
│     [17] _start() at .\client.jl:484
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:250
┌ Debug: Could not use CUTENSOR from artifacts
│   exception =
│    Cannot locate artifact 'CUTENSOR_CUDA110' in 'C:\Users\andre\.julia\packages\CUDA\B3eVM\Artifacts.toml'
│    Stacktrace:
│     [1] error(::String) at .\error.jl:33
│     [2] do_artifact_str(::String, ::Dict{String,Any}, ::String, ::Module) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Artifacts.jl:1019
│     [3] #invokelatest#1 at .\essentials.jl:712 [inlined]
│     [4] invokelatest at .\essentials.jl:711 [inlined]
│     [5] macro expansion at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Artifacts.jl:1058 [inlined]
│     [6] (::CUDA.var"#672#675")() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:272
│     [7] use_artifact_cutensor(::VersionNumber) at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:279
│     [8] use_artifact_cuda() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:180
│     [9] __init_dependencies__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:309
│     [10] __runtime_init__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:112
│     [11] (::CUDA.var"#609#610"{Bool})() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:32
│     [12] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at .\lock.jl:161
│     [13] _functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:26
│     [14] functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:19
│     [15] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:50 [inlined]
│     [16] toolkit_release() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:32
│     [17] top-level scope at C:\Users\andre\.julia\packages\CUDA\B3eVM\test\runtests.jl:112
│     [18] include(::String) at .\client.jl:439
│     [19] top-level scope at none:6
│     [20] eval(::Module, ::Any) at .\boot.jl:331
│     [21] exec_options(::Base.JLOptions) at .\client.jl:264
│     [22] _start() at .\client.jl:484
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:281
┌ Debug: Toolchain with LLVM 8.0.1, CUDA driver 11.0.0 and toolkit 11.0.194 supports devices 3.5, 3.7, 5.0, 5.2, 5.3, 6.0, 6.1, 6.2, 7.0, 7.2 and 7.5; PTX 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1 and 6.3
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\compatibility.jl:239
┌ Info: System information:
│ CUDA toolkit 11.0.194, artifact installation
│ CUDA driver 11.0.0
│ NVIDIA driver 451.67.0
│ 
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: 13.0.0
│ - NVML: 11.0.0+451.67
│ - CUDNN: missing
│ - CUTENSOR: missing
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ Environment:
│ - JULIA_CUDA_VERBOSE: true
│ 
│ 2 device(s):
│ - GeForce RTX 2070 (sm_75, 7.843 GiB / 8.000 GiB available)
└ - GeForce RTX 2080 Ti (sm_75, 10.790 GiB / 11.000 GiB available)
[ Info: Testing using device 2 (GeForce RTX 2080 Ti, UUID 4d228dab-1287-5f37-f4db-e53f82e2bc60)
[ Info: Skipping the following tests: cudnn, cutensor, device\wmma

and then more

┌ Debug: Could not use CUDNN from artifacts
│   exception =
│    KeyError: key v"11.0.0" not found
│    Stacktrace:
│     [1] getindex at .\dict.jl:477 [inlined]
│     [2] use_artifact_cudnn(::VersionNumber) at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:248
│     [3] use_artifact_cuda() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:179
│     [4] __init_dependencies__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:309
│     [5] __runtime_init__() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:112
│     [6] (::CUDA.var"#609#610"{Bool})() at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:32
│     [7] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at .\lock.jl:161
│     [8] _functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:26
│     [9] functional(::Bool) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:19
│     [10] libcuda at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\initialization.jl:50 [inlined]
│     [11] (::CUDA.var"#714#cache_fptr!#47")() at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\utils\call.jl:31
│     [12] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\utils\call.jl:39 [inlined]
│     [13] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\cudadrv\libcuda.jl:149 [inlined]
│     [14] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\cudadrv\error.jl:108 [inlined]
│     [15] cuCtxGetCurrent(::Base.RefValue{Ptr{Nothing}}) at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\utils\call.jl:93
│     [16] CuCurrentContext at C:\Users\andre\.julia\packages\CUDA\B3eVM\lib\cudadrv\context.jl:81 [inlined]
│     [17] initialize_thread(::Int64) at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\state.jl:51
│     [18] prepare_cuda_call at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\state.jl:34 [inlined]
│     [19] macro expansion at C:\Users\andre\.julia\packages\CUDA\B3eVM\src\pool.jl:432 [inlined]
│     [20] top-level scope at C:\Users\andre\.julia\packages\CUDA\B3eVM\test\setup.jl:46
│     [21] eval at .\boot.jl:331 [inlined]
│     [22] runtests(::Function, ::String, ::Bool, ::Nothing) at C:\Users\andre\.julia\packages\CUDA\B3eVM\test\setup.jl:56
│     [23] (::Distributed.var"#104#106"{Distributed.CallMsg{:call_fetch}})() at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:294
│     [24] run_work_thunk(::Distributed.var"#104#106"{Distributed.CallMsg{:call_fetch}}, ::Bool) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:79
│     [25] macro expansion at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:294 [inlined]
│     [26] (::Distributed.var"#103#105"{Distributed.CallMsg{:call_fetch},Distributed.MsgHeader,Sockets.TCPSocket})() at .\task.jl:358
└ @ CUDA C:\Users\andre\.julia\packages\CUDA\B3eVM\deps\bindeps.jl:250

Ah yes, a typo in the debug printing. Fixed!

This is expected, and only a debug message. There wasn’t a CUDNN for CUDA 11 last I checked (only a release candidate).

now local installation fails with no obvious (to me) reason
https://pastebin.com/dn5QRyqr

downloaded artifacts
https://pastebin.com/fHf7eQuw

There are no errors in those logs, and your toolchain is correctly discovered. Try calling CUDA.versioninfo(). At the end of the logs, there’s test failures, but without a stack trace so its unclear what the cause is (did you press CTRL-C?)

Sorry about that. Yes, I pressed Ctrl+C after first test fail messages. so now I run it fully.

I get a lot of those

Status `C:\Users\andre\AppData\Local\Temp\jl_7lNIMd\Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0 #master (https://github.com/JuliaGPU/CUDA.jl)
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1 #master (https://github.com/JuliaGPU/GPUArrays.jl)
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.19
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.1
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.17
  [476501e8] SLEEFPirates v0.5.5
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.25
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test 
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
┌ Info: System information:
│ CUDA toolkit 11.0.194, artifact installation
│ CUDA driver 11.0.0
│ NVIDIA driver 451.67.0
│ 
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: 13.0.0
│ - NVML: 11.0.0+451.67
│ - CUDNN: missing
│ - CUTENSOR: missing
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ Environment:
│ - JULIA_CUDA_VERBOSE: true
│ 
│ 2 device(s):
│ - GeForce RTX 2070 (sm_75, 7.843 GiB / 8.000 GiB available)
└ - GeForce RTX 2080 Ti (sm_75, 10.790 GiB / 11.000 GiB available)
[ Info: Testing using device 2 (GeForce RTX 2080 Ti, UUID 4d228dab-1287-5f37-f4db-e53f82e2bc60)
[ Info: Skipping the following tests: cudnn, cutensor, device\wmma
                                      |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                         (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |  
initialization                    (2) |         failed at 2020-07-24T13:27:34.893
apiutils                          (3) |         failed at 2020-07-24T13:27:35.226
codegen                           (6) |         failed at 2020-07-24T13:27:53.291
curand                            (9) |         failed at 2020-07-24T13:27:57.349
iterator                         (15) |         failed at 2020-07-24T13:28:04.485
cufft                             (8) |         failed at 2020-07-24T13:28:13.621
broadcast                         (5) |         failed at 2020-07-24T13:28:14.394
memory                           (16) |         failed at 2020-07-24T13:28:20.196
nvml                             (18) |         failed at 2020-07-24T13:28:30.279
cusparse                         (11) |         failed at 2020-07-24T13:28:31.791
nnlib                            (17) |         failed at 2020-07-24T13:28:37.649
nvtx                             (19) |         failed at 2020-07-24T13:28:39.256
pointer                          (20) |         failed at 2020-07-24T13:28:40.567
array                             (4) |         failed at 2020-07-24T13:28:47.409
cublas                            (7) |         failed at 2020-07-24T13:28:50.406
forwarddiff                      (14) |         failed at 2020-07-24T13:28:57.716
cusolver                         (10) |         failed at 2020-07-24T13:28:59.912
utils                            (24) |         failed at 2020-07-24T13:29:05.855
cudadrv\context                  (25) |         failed at 2020-07-24T13:29:06.031
cudadrv\devices                  (26) |         failed at 2020-07-24T13:29:06.953
threading                        (23) |         failed at 2020-07-24T13:29:14.078
cudadrv\errors                   (27) |         failed at 2020-07-24T13:29:14.339
statistics                       (21) |         failed at 2020-07-24T13:29:15.874
cudadrv\events                   (28) |         failed at 2020-07-24T13:29:16.084
cudadrv\execution                (29) |         failed at 2020-07-24T13:29:26.568
cudadrv\memory                   (30) |         failed at 2020-07-24T13:29:27.926
cudadrv\occupancy                (32) |         failed at 2020-07-24T13:29:32.272
cudadrv\module                   (31) |         failed at 2020-07-24T13:29:32.563
cudadrv\profile                  (33) |         failed at 2020-07-24T13:29:33.742
texture                          (22) |         failed at 2020-07-24T13:29:35.591
cudadrv\stream                   (34) |         failed at 2020-07-24T13:29:39.586
cudadrv\version                  (35) |         failed at 2020-07-24T13:29:40.345
device\array                     (37) |         failed at 2020-07-24T13:29:58.933
examples                         (12) |         failed at 2020-07-24T13:29:59.271
gpuarrays/input output           (42) |         failed at 2020-07-24T13:30:01.632
device\pointer                   (39) |         failed at 2020-07-24T13:30:11.444
cusolver\cusparse                (36) |         failed at 2020-07-24T13:30:13.575
gpuarrays/math                   (41) |         failed at 2020-07-24T13:30:14.664
execution                        (13) |         failed at 2020-07-24T13:30:17.007
gpuarrays/interface              (44) |         failed at 2020-07-24T13:30:19.197
gpuarrays/value constructors     (43) |         failed at 2020-07-24T13:30:23.082
gpuarrays/indexing               (40) |         failed at 2020-07-24T13:30:30.843
gpuarrays/conversions            (48) |         failed at 2020-07-24T13:30:33.976
gpuarrays/iterator constructors  (45) |         failed at 2020-07-24T13:30:38.571
gpuarrays/constructors           (50) |         failed at 2020-07-24T13:30:41.724
gpuarrays/uniformscaling         (46) |         failed at 2020-07-24T13:30:45.063
gpuarrays/fft                    (49) |         failed at 2020-07-24T13:30:45.665
gpuarrays/base                   (52) |         failed at 2020-07-24T13:31:12.59
gpuarrays/random                 (51) |         failed at 2020-07-24T13:31:14.822
device\intrinsics                (38) |         failed at 2020-07-24T13:31:16.072
gpuarrays/linear algebra         (47) |         failed at 2020-07-24T13:31:33.447
gpuarrays/broadcasting           (54) |         failed at 2020-07-24T13:31:57.442
gpuarrays/mapreduce essentials   (53) |         failed at 2020-07-24T13:32:22.903
gpuarrays/mapreduce (old tests)  (56) |         failed at 2020-07-24T13:32:40.311
gpuarrays/mapreduce derivatives  (55) |         failed at 2020-07-24T13:33:17.178
initialization: Error During Test at none:1
  Test threw exception
  Expression: initialization
  On worker 2:
  InexactError: check_top_bit(Int64, 18446744073709551615)
  throw_inexacterror at .\boot.jl:557
  check_top_bit at .\boot.jl:571 [inlined]
  toInt64 at .\boot.jl:632 [inlined]
  Int64 at .\boot.jl:707 [inlined]
  #448 at .\none:0 [inlined]
  iterate at .\generator.jl:47 [inlined]
  _all at .\reduce.jl:819 [inlined]
  all at .\reduce.jl:815 [inlined]
  Dict at .\dict.jl:130
  compute_processes at C:\Users\andre\.julia\packages\CUDA\Wfh4d\lib\nvml\device.jl:123 [inlined]
  runtests at C:\Users\andre\.julia\packages\CUDA\Wfh4d\test\setup.jl:64
  #104 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:294
  run_work_thunk at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:79       
  macro expansion at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:294 [inlined]
  #103 at .\task.jl:358

also this type

Worker 6 failed running test codegen:
Some tests did not pass: 17 passed, 1 failed, 0 errored, 0 broken.
codegen: Test Failed at C:\Users\andre\.julia\packages\CUDA\Wfh4d\test\codegen.jl:78
  Expression: !(occursin("gpu_report_exception", ir))
   Evaluated: !(occursin("gpu_report_exception", "\n;  @ C:\\Users\\andre\\.julia\\packages\\CUDA\\Wfh4d\\test\\codegen.jl:76 within `bar'\n; Function Attrs: uwtable\ndefine i64 @julia_bar_19729(i64) local_unnamed_addr #0 {\ntop:\n; ┌ @ operators.jl:294 within `>'\n; │┌ @ int.jl:49 within `<'\n    %1 = icmp sgt i64 %0, 0\n; └└\n; ┌ @ C:\\Users\\andre\\.julia\\packages\\CUDA\\Wfh4d\\src\\device\\llvm.jl:5 within `assume'\n   %2 = call fastcc %jl_value_t addrspace(10)* @gpu_gc_pool_alloc(i64 16)\n   %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*\n   %4 = bitcast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(10)* addrspace(10)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 381362496 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %4, align 8\n   %5 = bitcast %jl_value_t addrspace(11)* %3 to i8 addrspace(11)*\n   %6 = getelementptr inbounds i8, i8 addrspace(11)* %5, i64 8\n   %7 = bitcast i8 addrspace(11)* %6 to %jl_value_t addrspace(10)* addrspace(11)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1031050640 to %jl_value_t*) 
to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(11)* %7, align 8\n   call void @llvm.assume(i1 %1)\n; └\n; ┌ @ 
div.jl:227 within `cld'\n; │┌ @ div.jl:270 within `div' @ div.jl:215 @ int.jl:230\n    %8 = icmp eq i64 %0, 0\n    br i1 %8, label %fail, label %pass\n\nfail:                                             ; preds = %top\n    call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception to i64))\n    call fastcc void @gpu_signal_exception()\n    call void asm sideeffect \"exit;\", \"\"() #1\n    br label %pass\n\npass:                                             ; preds = %top, %fail\n    %9 = udiv i64 42, %0\n; 
││ @ div.jl:271 within `div'\n; ││┌ @ int.jl:54 within `*'\n     %10 = mul i64 %9, %0\n; ││└\n; ││┌ @ operators.jl:193 within `!='\n; │││┌ @ promotion.jl:398 within `=='\n      %11 = icmp ne i64 %10, 42\n; ││└└\n; ││┌ @ int.jl:858 within `+'\n; │││┌ @ int.jl:442 within `rem'\n; ││││┌ @ number.jl:7 within `convert'\n; │││││┌ @ boot.jl:707 within `Int64'\n; ││││││┌ @ boot.jl:634 within `toInt64'\n         %12 = zext i1 %11 to i64\n; │││└└└└\n; │││ @ int.jl:860 within `+' @ int.jl:53\n     %13 = add nuw nsw i64 %9, %12\n; └└└\n  ret i64 %13\n}\n"))
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\andre\.julia\packages\CUDA\Wfh4d\test\runtests.jl:442
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484

then a lot of those

Worker 13 failed running test execution:
Some tests did not pass: 72 passed, 8 failed, 0 errored, 0 broken.
execution: Test Failed at C:\Users\andre\.julia\packages\CUDA\Wfh4d\test\execution.jl:568
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\andre\.julia\packages\CUDA\Wfh4d\test\runtests.jl:442
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484

The actual result

Test Summary:                     | Pass  Fail  Error  Total
  Overall                         |   89     9     53    151
    initialization                |                 1      1
    apiutils                      |                 1      1
    codegen                       |   17     1            18
    curand                        |                 1      1
    iterator                      |                 1      1
    cufft                         |                 1      1
    broadcast                     |                 1      1
    memory                        |                 1      1
    nvml                          |                 1      1
    cusparse                      |                 1      1
    nnlib                         |                 1      1
    nvtx                          |                 1      1
    pointer                       |                 1      1
    array                         |                 1      1
    cublas                        |                 1      1
    forwarddiff                   |                 1      1
    cusolver                      |                 1      1
    utils                         |                 1      1
    cudadrv\context               |                 1      1
    cudadrv\devices               |                 1      1
    threading                     |                 1      1
    cudadrv\errors                |                 1      1
    statistics                    |                 1      1
    cudadrv\events                |                 1      1
    cudadrv\execution             |                 1      1
    cudadrv\memory                |                 1      1
    cudadrv\occupancy             |                 1      1
    cudadrv\module                |                 1      1
    cudadrv\profile               |                 1      1
    texture                       |                 1      1
    cudadrv\stream                |                 1      1
    cudadrv\version               |                 1      1
    device\array                  |                 1      1
    examples                      |                 1      1
    gpuarrays/input output        |                 1      1
    device\pointer                |                 1      1
    cusolver\cusparse             |                 1      1
    gpuarrays/math                |                 1      1
    execution                     |   72     8            80
    gpuarrays/interface           |                 1      1
    gpuarrays/value constructors  |                 1      1
    gpuarrays/indexing            |                 1      1
    gpuarrays/conversions         |                 1      1
    gpuarrays/iterator constructors |                 1      1
    gpuarrays/constructors        |                 1      1
    gpuarrays/uniformscaling      |                 1      1
    gpuarrays/fft                 |                 1      1
    gpuarrays/base                |                 1      1
    gpuarrays/random              |                 1      1
    device\intrinsics             |                 1      1
    gpuarrays/linear algebra      |                 1      1
    gpuarrays/broadcasting        |                 1      1
    gpuarrays/mapreduce essentials |                 1      1
    gpuarrays/mapreduce (old tests) |                 1      1
    gpuarrays/mapreduce derivatives |                 1      1

I case I missed something important. here whole thing

https://pastebin.com/hvszZz9i

Weirdly, the process IDs returned there aren’t valid. Could you try the following code:

julia> using CUDA

julia> CuArray([1]); # to initialize CUDA

julia> using .NVML

julia> count_ref = Ref{Cuint}(0);

julia> dev = first(NVML.devices())
CUDA.NVML.Device(Ptr{Nothing} @0x00007fde5d7aae38)

julia> NVML.nvmlDeviceGetComputeRunningProcesses(dev, count_ref, C_NULL)
ERROR: NVMLError: Insufficient Size (code 7)

julia> # NVML_ERROR_INSUFFICIENT_SIZE is expected here

julia> count_ref[]
0x00000003

julia> infos = Vector{NVML.nvmlProcessInfo_t}(undef, count_ref[]);

julia> NVML.nvmlDeviceGetComputeRunningProcesses(dev, count_ref, infos)

julia> infos
3-element Array{CUDA.NVML.nvmlProcessInfo_st,1}:
 CUDA.NVML.nvmlProcessInfo_st(0x0000cb97, 0x000000000cd00000)
 CUDA.NVML.nvmlProcessInfo_st(0x0000fe64, 0x0000000008f00000)
 CUDA.NVML.nvmlProcessInfo_st(0x00010c92, 0x0000000007500000)

Note that the failures are in the code that reports test output, actual functionality is likely working perfectly.

julia> using CUDA

julia> 

julia> CuArray([1]); # to initialize CUDA

julia> 

julia> using .NVML

julia> 

julia> count_ref = Ref{Cuint}(0);

julia> 

julia> dev = first(NVML.devices())
CUDA.NVML.Device(Ptr{Nothing} @0x00007ffd42659c20)

julia> 

julia> NVML.nvmlDeviceGetComputeRunningProcesses(dev, count_ref, C_NULL)

julia> 

julia> count_ref[]
0x00000000

julia>

julia> infos = Vector{NVML.nvmlProcessInfo_t}(undef, count_ref[]);

julia>

julia> NVML.nvmlDeviceGetComputeRunningProcesses(dev, count_ref, infos)

julia>

julia> infos
0-element Array{CUDA.NVML.nvmlProcessInfo_st,1}

Looks like CUDA didn’t initialize. Can you add a @show infos in compute_processes before it errors on line 123 (and then run the tests)?

all workers seem to report this

From worker 2:    infos = CUDA.NVML.nvmlProcessInfo_st[CUDA.NVML.nvmlProcessInfo_st(0x00002fb0, 0xffffffffffffffff), 
CUDA.NVML.nvmlProcessInfo_st(0x00004e94, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x000084ec, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00002d28, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x0000808c, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00009114, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x0000971c, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00000fe8, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00003b30, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00004e98, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00007a1c, 0xffffffffffffffff), CUDA.NVML.nvmlProcessInfo_st(0x00002084, 0xffffffffffffffff)]

Aha, it’s not the PID but the usedGpuMemory that’s unavailable here. I’ll push a fix.
https://github.com/JuliaGPU/CUDA.jl/pull/314

       unsigned long long usedGpuMemory
           Amount of used GPU memory in bytes. Under WDDM, NVML_VALUE_NOT_AVAILABLE is always
           reported because Windows KMD manages all the memory and not the NVIDIA driver.

Yes, above is helping. only some tests are failing now

Status `C:\Users\andre\AppData\Local\Temp\jl_zJKTY5\Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0 [`C:\Src\cuda\cuda\dev\CUDA`]
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1 #master (https://github.com/JuliaGPU/GPUArrays.jl)
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.19
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.1
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.17
  [476501e8] SLEEFPirates v0.5.5
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.25
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test 
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
┌ Info: System information:
│ CUDA toolkit 11.0.194, artifact installation
│ CUDA driver 11.0.0
│ NVIDIA driver 451.67.0
│ 
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: 13.0.0
│ - NVML: 11.0.0+451.67
│ - CUDNN: missing
│ - CUTENSOR: missing
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ Environment:
│ - JULIA_CUDA_VERBOSE: true
│ 
│ 2 device(s):
│ - GeForce RTX 2070 (sm_75, 7.843 GiB / 8.000 GiB available)
└ - GeForce RTX 2080 Ti (sm_75, 10.790 GiB / 11.000 GiB available)
[ Info: Testing using 1 device(s): 2. GeForce RTX 2080 Ti (UUID 4d228dab-1287-5f37-f4db-e53f82e2bc60)
[ Info: Skipping the following tests: cudnn, cutensor, device\wmma
                                      |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                         (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                    (2) |     3.65 |   0.00 |  0.0 |       0.00 |      N/A |   0.05 |  1.4 |     199.20 |   562.26 |
apiutils                          (3) |     0.81 |   0.00 |  0.0 |       0.00 |      N/A |   0.03 |  3.5 |      85.70 |   552.87 |
iterator                          (3) |     2.71 |   0.27 | 10.0 |       1.25 |      N/A |   0.09 |  3.4 |     281.28 |   552.87 |
memory                            (3) |     1.53 |   0.00 |  0.0 |       0.00 |      N/A |   0.33 | 21.2 |     142.82 |   552.87 |
codegen                           (6) |         failed at 2020-07-24T20:12:39.053
nnlib                             (3) |    13.92 |   0.15 |  1.1 |       0.00 |      N/A |   0.59 |  4.2 |    1550.51 |   980.92 |
nvml                              (3) |     0.51 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  3.4 |      58.52 |   986.84 |
nvtx                              (3) |     0.49 |   0.00 |  0.0 |       0.00 |      N/A |   0.03 |  6.8 |      63.11 |   990.04 |
pointer                           (3) |     0.11 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       7.06 |   990.66 |
curand                            (9) |    23.48 |   0.26 |  1.1 |       0.02 |      N/A |   1.18 |  5.0 |    2416.18 |   634.97 |
statistics                        (3) |    16.18 |   0.00 |  0.0 |       0.00 |      N/A |   0.81 |  5.0 |    1582.73 |  1046.86 |
utils                             (3) |     0.85 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  2.7 |     118.93 |  1046.86 |
cufft                             (8) |    38.56 |   0.27 |  0.7 |     133.23 |      N/A |   1.97 |  5.1 |    4257.42 |  1189.55 |
cudadrv\context                   (3) |     0.58 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  2.8 |      32.43 |  1099.50 |
cudadrv\devices                   (8) |     0.34 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      40.12 |  1189.55 |
broadcast                         (5) |    39.32 |   0.27 |  0.7 |       0.00 |      N/A |   1.56 |  4.0 |    3605.98 |   677.96 |
cudadrv\errors                    (3) |     0.15 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      21.96 |  1099.50 |
cudadrv\events                    (8) |     0.15 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      14.93 |  1189.55 |
cudadrv\module                    (8) |     0.37 |   0.00 |  0.0 |       0.00 |      N/A |   0.01 |  3.2 |      46.07 |  1189.55 |
cudadrv\occupancy                 (8) |     0.11 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       8.55 |  1189.55 |
cudadrv\profile                   (8) |     0.30 |   0.00 |  0.0 |       0.00 |      N/A |   0.01 |  4.8 |      49.13 |  1189.55 |
cudadrv\execution                 (5) |     1.01 |   0.00 |  0.0 |       0.00 |      N/A |   0.05 |  5.4 |     108.96 |   677.96 |
cudadrv\version                   (5) |     0.01 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       0.08 |   677.96 |
cudadrv\stream                    (8) |     0.21 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  9.0 |      20.06 |  1189.55 |
cudadrv\memory                    (3) |     2.10 |   0.00 |  0.0 |       0.00 |      N/A |   0.07 |  3.5 |     206.39 |  1099.50 |
device\array                      (8) |     3.58 |   0.00 |  0.0 |       0.00 |      N/A |   0.10 |  2.8 |     339.27 |  1194.57 |
texture                           (9) |    26.26 |   0.00 |  0.0 |       0.08 |      N/A |   0.92 |  3.5 |    2530.68 |   720.41 |
device\pointer                    (8) |     7.09 |   0.00 |  0.0 |       0.00 |      N/A |   0.20 |  2.9 |     568.32 |  1208.79 |
gpuarrays/math                    (8) |     2.93 |   0.00 |  0.0 |       0.00 |      N/A |   0.09 |  3.2 |     306.95 |  1217.26 |
gpuarrays/input output            (8) |     1.19 |   0.00 |  0.0 |       0.00 |      N/A |   0.04 |  3.7 |     126.49 |  1217.26 |
cusparse                         (11) |    56.99 |   0.32 |  0.6 |       4.46 |      N/A |   2.41 |  4.2 |    6253.25 |   967.00 |
threading                        (14) |    16.95 |   4.44 | 26.2 |      56.25 |      N/A |   0.83 |  4.9 |    1801.75 |  1274.68 |
forwarddiff                       (2) |    58.55 |   0.30 |  0.5 |       0.00 |      N/A |   1.32 |  2.3 |    3334.10 |   737.97 |
gpuarrays/interface              (11) |     3.34 |   0.00 |  0.0 |       0.00 |      N/A |   0.11 |  3.4 |     253.39 |   975.28 |
cusolver\cusparse                 (5) |    19.97 |   0.00 |  0.0 |       0.19 |      N/A |   0.84 |  4.2 |    2063.14 |  1441.06 |
gpuarrays/value constructors      (8) |     7.89 |   0.00 |  0.0 |       0.00 |      N/A |   0.17 |  2.2 |     573.97 |  1238.47 |
gpuarrays/fft                     (8) |     1.24 |   0.00 |  0.1 |       6.01 |      N/A |   0.04 |  3.3 |     133.59 |  1247.52 |
gpuarrays/conversions             (5) |     4.46 |   0.00 |  0.0 |       0.01 |      N/A |   0.19 |  4.3 |     626.51 |  1448.38 |
gpuarrays/constructors            (8) |     1.42 |   0.00 |  0.2 |       0.04 |      N/A |   0.01 |  0.9 |      76.52 |  1249.82 |
gpuarrays/indexing                (9) |    17.71 |   0.00 |  0.0 |       0.12 |      N/A |   0.57 |  3.2 |    1458.04 |   750.68 |
gpuarrays/uniformscaling          (2) |     8.12 |   0.00 |  0.0 |       0.01 |      N/A |   0.26 |  3.2 |     762.20 |   737.97 |
array                             (4) |    73.13 |   0.20 |  0.3 |       5.20 |      N/A |   2.67 |  3.7 |    6785.38 |   784.63 |
cublas                            (7) |    75.23 |   0.35 |  0.5 |      11.12 |      N/A |   3.19 |  4.2 |    8831.23 |  1199.72 |
gpuarrays/iterator constructors  (14) |    19.32 |   0.00 |  0.0 |       0.02 |      N/A |   0.62 |  3.2 |    1829.22 |  1278.43 |
gpuarrays/base                    (8) |    15.86 |   0.00 |  0.0 |      17.61 |      N/A |   0.60 |  3.8 |    1872.71 |  1348.77 |
cusolver                         (10) |    83.64 |   0.42 |  0.5 |    1229.85 |      N/A |   3.46 |  4.1 |    8887.82 |  1574.50 |
gpuarrays/random                  (5) |    19.85 |   0.00 |  0.0 |       0.00 |      N/A |   0.53 |  2.7 |    1656.92 |  1505.24 |
device\intrinsics                 (3) |    67.58 |   0.00 |  0.0 |       0.01 |      N/A |   1.30 |  1.9 |    5025.86 |  1201.33 |
gpuarrays/linear algebra         (11) |    48.93 |   0.01 |  0.0 |       1.42 |      N/A |   1.53 |  3.1 |    5037.09 |  1463.40 |
examples                         (12) |   124.64 |   0.00 |  0.0 |       0.00 |      N/A |   0.08 |  0.1 |     132.75 |   562.43 |
gpuarrays/broadcasting            (2) |    60.48 |   0.00 |  0.0 |       1.19 |      N/A |   2.02 |  3.3 |    7448.78 |   853.78 |
execution                        (13) |         failed at 2020-07-24T20:14:35.877
gpuarrays/mapreduce essentials    (9) |    89.02 |   0.01 |  0.0 |       3.19 |      N/A |   2.43 |  2.7 |   10171.24 |   980.95 |
gpuarrays/mapreduce (old tests)   (7) |    92.07 |   0.01 |  0.0 |     130.20 |      N/A |   2.21 |  2.4 |   10276.32 |  1506.56 |
gpuarrays/mapreduce derivatives   (4) |   126.02 |   0.01 |  0.0 |       3.06 |      N/A |   3.02 |  2.4 |   13688.27 |  1117.89 |
Worker 6 failed running test codegen:
Some tests did not pass: 17 passed, 1 failed, 0 errored, 0 broken.
codegen: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\codegen.jl:78
  Expression: !(occursin("gpu_report_exception", ir))
   Evaluated: !(occursin("gpu_report_exception", "\n;  @ C:\\Src\\\cuda\\cuda\\dev\\CUDA\\test\\codegen.jl:76 within `bar'\n; Function Attrs: uwtable\ndefine i64 @julia_bar_19727(i64) local_unnamed_addr #0 {\ntop:\n; ┌ @ operators.jl:294 within `>'\n; │┌ @ int.jl:49 within `<'\n    %1 = icmp sgt i64 %0, 0\n; └└\n; ┌ @ C:\\Src\\\cuda\\cuda\\dev\\CUDA\\src\\device\\llvm.jl:5 within `assume'\n   %2 = call fastcc %jl_value_t addrspace(10)* @gpu_gc_pool_alloc(i64 16)\n   %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*\n   %4 = bitcast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(10)* addrspace(10)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1023020288 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %4, align 8\n   %5 = bitcast %jl_value_t addrspace(11)* %3 to i8 addrspace(11)*\n   %6 = getelementptr inbounds i8, i8 addrspace(11)* %5, i64 8\n   %7 
= bitcast i8 addrspace(11)* %6 to %jl_value_t addrspace(10)* addrspace(11)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 370085648 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(11)* %7, align 8\n   call void @llvm.assume(i1 %1)\n; └\n; ┌ @ div.jl:227 within `cld'\n; │┌ @ div.jl:270 within `div' @ div.jl:215 @ int.jl:230\n    %8 = icmp eq i64 %0, 0\n    br i1 %8, label %fail, label %pass\n\nfail:
                         ; preds = %top\n    call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception to 
i64))\n    call fastcc void @gpu_signal_exception()\n    call void asm sideeffect \"exit;\", \"\"() #1\n    br label %pass\n\npass:                                             ; preds = %top, %fail\n    %9 = udiv i64 42, %0\n; ││ @ div.jl:271 within `div'\n; ││┌ @ int.jl:54 within `*'\n     %10 = mul i64 %9, %0\n; ││└\n; ││┌ @ operators.jl:193 within `!='\n; │││┌ @ promotion.jl:398 within `=='\n      %11 = icmp ne i64 %10, 42\n; ││└└\n; ││┌ @ int.jl:858 within `+'\n; │││┌ @ int.jl:442 within `rem'\n; ││││┌ @ number.jl:7 within `convert'\n; │││││┌ @ boot.jl:707 within `Int64'\n; ││││││┌ @ boot.jl:634 within `toInt64'\n         %12 = zext i1 %11 to i64\n; │││└└└└\n; │││ @ int.jl:860 within `+' @ int.jl:53\n     %13 = add nuw nsw i64 %9, %12\n; └└└\n  ret i64 %13\n}\n"))
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
Worker 13 failed running test execution:
Some tests did not pass: 72 passed, 8 failed, 0 errored, 0 broken.
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:568
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:569
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:575
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:576
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:582
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:583
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:609
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:610
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484

Test Summary:                     | Pass  Fail  Broken  Total
  Overall                         | 8167     9       2   8178
    initialization                |   11                   11
    apiutils                      |   15                   15
    iterator                      |   30                   30
    memory                        |   10                   10
    codegen                       |   17     1             18
    nnlib                         |    3                    3
    nvml                          |    7                    7
    nvtx                          |                     No tests
    pointer                       |   13                   13
    curand                        |  102                  102
    statistics                    |   12                   12
    utils                         |    5                    5
    cufft                         |  151                  151
    cudadrv\context               |   12                   12
    cudadrv\devices               |    5                    5
    broadcast                     |   29                   29
    cudadrv\errors                |    6                    6
    cudadrv\events                |    6                    6
    cudadrv\module                |   11                   11
    cudadrv\occupancy             |    1                    1
    cudadrv\profile               |    2                    2
    cudadrv\execution             |   15                   15
    cudadrv\version               |    3                    3
    cudadrv\stream                |    7                    7
    cudadrv\memory                |   50             1     51
    device\array                  |   20                   20
    texture                       |   26             1     27
    device\pointer                |   57                   57
    gpuarrays/math                |    8                    8
    gpuarrays/input output        |    5                    5
    cusparse                      |  468                  468
    threading                     |                     No tests
    forwarddiff                   |  106                  106
    gpuarrays/interface           |    7                    7
    cusolver\cusparse             |   84                   84
    gpuarrays/value constructors  |  120                  120
    gpuarrays/fft                 |   12                   12
    gpuarrays/conversions         |   72                   72
    gpuarrays/constructors        |  335                  335
    gpuarrays/indexing            |  113                  113
    gpuarrays/uniformscaling      |   56                   56
    array                         |  154                  154
    cublas                        | 1885                 1885
    gpuarrays/iterator constructors |   24                   24
    gpuarrays/base                |   38                   38
    cusolver                      | 1493                 1493
    gpuarrays/random              |   40                   40
    device\intrinsics             |  265                  265
    gpuarrays/linear algebra      |  393                  393
    examples                      |    7                    7
    gpuarrays/broadcasting        |  155                  155
    execution                     |   72     8             80
    gpuarrays/mapreduce essentials |  522                  522
    gpuarrays/mapreduce (old tests) |  297                  297
    gpuarrays/mapreduce derivatives |  810                  810
    FAILURE

Error in testset codegen:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\codegen.jl:78
  Expression: !(occursin("gpu_report_exception", ir))
   Evaluated: !(occursin("gpu_report_exception", "\n;  @ C:\\Src\\\cuda\\cuda\\dev\\CUDA\\test\\codegen.jl:76 within `bar'\n; Function Attrs: uwtable\ndefine i64 @julia_bar_19727(i64) local_unnamed_addr #0 {\ntop:\n; ┌ @ operators.jl:294 within `>'\n; │┌ @ int.jl:49 within `<'\n    %1 = icmp sgt i64 %0, 0\n; └└\n; ┌ @ C:\\Src\\\cuda\\cuda\\dev\\CUDA\\src\\device\\llvm.jl:5 within `assume'\n   %2 = call fastcc %jl_value_t addrspace(10)* @gpu_gc_pool_alloc(i64 16)\n   %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*\n   %4 = bitcast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(10)* addrspace(10)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1023020288 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %4, align 8\n   %5 = bitcast %jl_value_t addrspace(11)* %3 to i8 addrspace(11)*\n   %6 = getelementptr inbounds i8, i8 addrspace(11)* %5, i64 8\n   %7 
= bitcast i8 addrspace(11)* %6 to %jl_value_t addrspace(10)* addrspace(11)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 370085648 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(11)* %7, align 8\n   call void @llvm.assume(i1 %1)\n; └\n; ┌ @ div.jl:227 within `cld'\n; │┌ @ div.jl:270 within `div' @ div.jl:215 @ int.jl:230\n    %8 = icmp eq i64 %0, 0\n    br i1 %8, label %fail, label %pass\n\nfail:
                         ; preds = %top\n    call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception to 
i64))\n    call fastcc void @gpu_signal_exception()\n    call void asm sideeffect \"exit;\", \"\"() #1\n    br label %pass\n\npass:                                             ; preds = %top, %fail\n    %9 = udiv i64 42, %0\n; ││ @ div.jl:271 within `div'\n; ││┌ @ int.jl:54 within `*'\n     %10 = mul i64 %9, %0\n; ││└\n; ││┌ @ operators.jl:193 within `!='\n; │││┌ @ promotion.jl:398 within `=='\n      %11 = icmp ne i64 %10, 42\n; ││└└\n; ││┌ @ int.jl:858 within `+'\n; │││┌ @ int.jl:442 within `rem'\n; ││││┌ @ number.jl:7 within `convert'\n; │││││┌ @ boot.jl:707 within `Int64'\n; ││││││┌ @ boot.jl:634 within `toInt64'\n         %12 = zext i1 %11 to i64\n; │││└└└└\n; │││ @ int.jl:860 within `+' @ int.jl:53\n     %13 = add nuw nsw i64 %9, %12\n; └└└\n  ret i64 %13\n}\n"))
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:568
  Expression: code == 1
   Evaluated: 0 == 1
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:569
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:575
  Expression: code == 1
   Evaluated: 0 == 1
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:576
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:582
  Expression: code == 1
   Evaluated: 0 == 1
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:583
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:609
  Expression: code == 1
   Evaluated: 0 == 1
Error in testset execution:
Test Failed at C:\Src\cuda\cuda\dev\CUDA\test\execution.jl:610
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
ERROR: LoadError: Test run finished with errors
in expression starting at C:\Src\cuda\cuda\dev\CUDA\test\runtests.jl:475

Interesting failures. The assume I haven’t seen before, it’s weird how your LLVM optimizes differently, but the functionality isn’t used so the failure is harmless. The other failures are harmless too, although they might indicate a CUDA driver bug. I’ve created https://github.com/JuliaGPU/CUDA.jl/pull/319 to make the test more robust and hopefully pass that test on your system.

I still see errors in test

Status `C:\Users\andre\AppData\Local\Temp\jl_hFTchO\Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.0 [`C:\Users\andre\.julia\dev\CUDA`]
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [adafc99b] CpuId v0.2.2
  [864edb3b] DataStructures v0.17.19
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [ffbed154] DocStringExtensions v0.8.2
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.8.13
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v4.0.1 #master (https://github.com/JuliaGPU/GPUArrays.jl)
  [61eb1bfa] GPUCompiler v0.5.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [bdcacae8] LoopVectorization v0.8.20
  [856f044c] MKL_jll v2020.1.216+0
  [1914dd2f] MacroTools v0.5.5
  [a6bfbf70] NNPACK_jll v2018.6.22+0
  [872c559c] NNlib v0.7.3
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.1.2
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [21efa798] SIMDPirates v0.8.21
  [476501e8] SLEEFPirates v0.5.5
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [3a884ed6] UnPack v1.0.1
  [3d5dd08c] VectorizationBase v0.12.29
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA 
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
┌ Info: System information:
│ CUDA toolkit 11.0.194, artifact installation
│ CUDA driver 11.0.0
│ NVIDIA driver 451.67.0
│ 
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: 13.0.0
│ - NVML: 11.0.0+451.67
│ - CUDNN: missing
│ - CUTENSOR: missing
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ Environment:
│ - JULIA_CUDA_VERBOSE: true
│ 
│ 2 device(s):
│ - GeForce RTX 2070 (sm_75, 7.843 GiB / 8.000 GiB available)
└ - GeForce RTX 2080 Ti (sm_75, 10.790 GiB / 11.000 GiB available)
[ Info: Testing using 1 device(s): 2. GeForce RTX 2080 Ti (UUID 4d228dab-1287-5f37-f4db-e53f82e2bc60)
[ Info: Skipping the following tests: cudnn, cutensor, device\wmma
                                      |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                         (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
apiutils                          (3) |     0.56 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  3.6 |      85.74 |   541.16 |
initialization                    (2) |     3.57 |   0.00 |  0.0 |       0.00 |      N/A |   0.07 |  1.8 |     199.58 |   566.14 |
iterator                          (2) |     2.69 |   0.29 | 10.7 |       1.25 |      N/A |   0.11 |  4.2 |     283.23 |   566.14 |
memory                            (2) |     1.67 |   0.00 |  0.0 |       0.00 |      N/A |   0.36 | 21.8 |     146.26 |   566.14 |
codegen                           (6) |         failed at 2020-07-27T14:09:49.998
nnlib                             (2) |    14.42 |   0.15 |  1.0 |       0.00 |      N/A |   0.56 |  3.9 |    1549.00 |  1073.51 |
nvml                              (2) |     0.51 |   0.00 |  0.0 |       0.00 |      N/A |   0.03 |  5.6 |      58.10 |  1073.51 |
nvtx                              (2) |     0.48 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  3.3 |      74.69 |  1073.51 |
pointer                           (2) |     0.12 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       7.09 |  1073.51 |
curand                            (9) |    23.46 |   0.29 |  1.2 |       0.02 |      N/A |   1.25 |  5.3 |    2369.98 |   604.86 |
statistics                        (2) |    16.12 |   0.00 |  0.0 |       0.00 |      N/A |   0.84 |  5.2 |    1625.56 |  1090.46 |
broadcast                         (5) |    38.18 |   0.30 |  0.8 |       0.00 |      N/A |   1.58 |  4.1 |    3502.24 |   675.89 |
cufft                             (8) |    39.44 |   0.30 |  0.8 |     133.23 |      N/A |   1.98 |  5.0 |    4259.85 |  1181.87 |
utils                             (5) |     1.21 |   0.00 |  0.0 |       0.00 |      N/A |   0.07 |  5.9 |     151.43 |   675.89 |
cudadrv\errors                    (5) |     0.16 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      22.49 |   675.89 |
cudadrv\devices                   (8) |     0.37 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      39.87 |  1181.87 |
cudadrv\events                    (5) |     0.15 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      14.25 |   675.89 |
cudadrv\execution                 (8) |     1.47 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  1.1 |      74.55 |  1183.48 |
threading                         (2) |     4.16 |   0.92 | 22.0 |      56.25 |      N/A |   0.11 |  2.6 |     469.27 |  1218.40 |
cudadrv\module                    (8) |     0.59 |   0.00 |  0.0 |       0.00 |      N/A |   0.01 |  2.1 |      42.89 |  1185.80 |
cudadrv\occupancy                 (2) |     0.10 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       8.42 |  1218.86 |
cudadrv\profile                   (8) |     0.31 |   0.00 |  0.0 |       0.00 |      N/A |   0.01 |  4.7 |      49.02 |  1185.80 |
cudadrv\stream                    (2) |     0.30 |   0.00 |  0.2 |       0.00 |      N/A |   0.05 | 17.8 |      29.84 |  1218.96 |
cudadrv\version                   (8) |     0.01 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       0.08 |  1185.80 |
cudadrv\memory                    (5) |     2.89 |   0.00 |  0.0 |       0.00 |      N/A |   0.09 |  3.0 |     211.58 |   678.09 |
cudadrv\context                  (14) |     1.18 |   0.00 |  0.0 |       0.00 |      N/A |   0.04 |  3.3 |     121.68 |   616.20 |
device\array                      (8) |     3.60 |   0.00 |  0.0 |       0.00 |      N/A |   0.09 |  2.4 |     336.02 |  1196.71 |
texture                           (9) |    25.67 |   0.00 |  0.0 |       0.08 |      N/A |   1.07 |  4.2 |    2776.62 |   734.56 |
gpuarrays/math                    (9) |     2.83 |   0.00 |  0.0 |       0.00 |      N/A |   0.09 |  3.3 |     314.90 |   734.91 |
gpuarrays/input output            (9) |     1.29 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  1.7 |     127.70 |   734.91 |
cusparse                         (11) |    57.69 |   0.31 |  0.5 |       4.46 |      N/A |   2.54 |  4.4 |    6234.63 |   967.88 |
forwarddiff                       (3) |    59.99 |   0.38 |  0.6 |       0.00 |      N/A |   1.47 |  2.5 |    3475.23 |   719.55 |
cusolver\cusparse                 (2) |    17.83 |   0.01 |  0.0 |       0.19 |      N/A |   0.77 |  4.3 |    1926.42 |  1684.84 |
gpuarrays/interface              (11) |     2.81 |   0.00 |  0.0 |       0.00 |      N/A |   0.10 |  3.5 |     263.06 |   975.27 |
gpuarrays/value constructors      (9) |     7.52 |   0.00 |  0.0 |       0.00 |      N/A |   0.20 |  2.7 |     650.63 |   753.97 |
device\pointer                   (14) |    17.74 |   0.29 |  1.6 |       0.00 |      N/A |   0.82 |  4.6 |    1818.78 |   616.20 |
gpuarrays/conversions             (9) |     4.16 |   0.00 |  0.0 |       0.01 |      N/A |   0.21 |  5.0 |     581.19 |   767.15 |
gpuarrays/indexing                (8) |    19.63 |   0.00 |  0.0 |       0.13 |      N/A |   0.71 |  3.6 |    1936.64 |  1286.61 |
gpuarrays/constructors            (9) |     1.53 |   0.00 |  0.2 |       0.04 |      N/A |   0.03 |  1.7 |      86.16 |   777.34 |
gpuarrays/uniformscaling          (2) |     7.45 |   0.00 |  0.0 |       0.01 |      N/A |   0.20 |  2.7 |     630.02 |  1684.84 |
array                             (4) |    71.59 |   0.30 |  0.4 |       5.20 |      N/A |   2.88 |  4.0 |    6732.22 |   774.98 |
gpuarrays/fft                    (14) |     7.95 |   0.00 |  0.0 |       6.01 |      N/A |   0.33 |  4.2 |     872.91 |  1176.69 |
cublas                            (7) |    76.81 |   0.36 |  0.5 |      11.12 |      N/A |   3.34 |  4.3 |    8808.45 |  1240.90 |
gpuarrays/iterator constructors   (3) |    19.00 |   0.00 |  0.0 |       0.02 |      N/A |   0.73 |  3.8 |    2004.91 |   758.60 |
cusolver                         (10) |    82.26 |   0.40 |  0.5 |    1229.85 |      N/A |   3.64 |  4.4 |    8737.55 |  1571.16 |
gpuarrays/base                    (9) |    16.35 |   0.00 |  0.0 |      17.61 |      N/A |   0.75 |  4.6 |    1867.16 |   856.94 |
gpuarrays/random                  (8) |    20.07 |   0.00 |  0.0 |       0.00 |      N/A |   0.51 |  2.6 |    1659.26 |  1369.74 |
device\intrinsics                 (5) |    67.23 |   0.00 |  0.0 |       0.01 |      N/A |   1.40 |  2.1 |    4967.24 |   859.46 |
gpuarrays/linear algebra         (11) |    57.55 |   0.01 |  0.0 |       1.42 |      N/A |   1.49 |  2.6 |    5148.27 |  1519.13 |
examples                         (12) |   127.03 |   0.00 |  0.0 |       0.00 |      N/A |   0.09 |  0.1 |     132.75 |   555.79 |
gpuarrays/broadcasting            (4) |    57.83 |   0.00 |  0.0 |       1.19 |      N/A |   1.97 |  3.4 |    7069.32 |   933.43 |
execution                        (13) |         failed at 2020-07-27T14:11:53.175
gpuarrays/mapreduce (old tests)   (7) |    84.84 |   0.01 |  0.0 |     130.20 |      N/A |   2.53 |  3.0 |   10279.09 |  1458.08 |
gpuarrays/mapreduce essentials    (2) |    93.39 |   0.01 |  0.0 |       3.19 |      N/A |   3.03 |  3.2 |   12281.01 |  1899.16 |
gpuarrays/mapreduce derivatives  (14) |   127.34 |   0.01 |  0.0 |       3.06 |      N/A |   3.44 |  2.7 |   15804.64 |  1610.81 |
Worker 6 failed running test codegen:
Some tests did not pass: 17 passed, 1 failed, 0 errored, 0 broken.
codegen: Test Failed at C:\Users\andre\.julia\dev\CUDA\test\codegen.jl:78
  Expression: !(occursin("gpu_report_exception", ir))
   Evaluated: !(occursin("gpu_report_exception", "\n;  @ C:\\Users\\andre\\.julia\\dev\\CUDA\\test\\codegen.jl:76 within `bar'\n; Function Attrs: uwtable\ndefine i64 @julia_bar_19729(i64) local_unnamed_addr #0 {\ntop:\n; ┌ @ operators.jl:294 within `>'\n; │┌ @ int.jl:49 within `<'\n    %1 = icmp sgt i64 %0, 0\n; └└\n; ┌ @ C:\\Users\\andre\\.julia\\dev\\CUDA\\src\\device\\llvm.jl:5 within `assume'\n   %2 = call fastcc %jl_value_t addrspace(10)* @gpu_gc_pool_alloc(i64 16)\n   %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*\n   %4 = bitcast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(10)* addrspace(10)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1035379872 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %4, align 8\n   %5 = bitcast %jl_value_t addrspace(11)* %3 to i8 addrspace(11)*\n   %6 = getelementptr inbounds i8, i8 addrspace(11)* %5, i64 8\n   %7 = 
bitcast i8 addrspace(11)* %6 to %jl_value_t addrspace(10)* addrspace(11)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1033654928 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(11)* %7, align 8\n   call void @llvm.assume(i1 %1)\n; └\n; ┌ @ div.jl:227 within `cld'\n; │┌ @ div.jl:270 within `div' 
@ div.jl:215 @ int.jl:230\n    %8 = icmp eq i64 %0, 0\n    br i1 %8, label %fail, label %pass\n\nfail:
                        ; preds = %top\n    call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception to i64))\n    call fastcc void @gpu_signal_exception()\n    call void asm sideeffect \"exit;\", \"\"() #1\n    br label %pass\n\npass:                                             ; preds = %top, %fail\n    %9 = udiv i64 42, %0\n; ││ @ div.jl:271 within `div'\n; ││┌ @ int.jl:54 within `*'\n     %10 = mul i64 %9, %0\n; ││└\n; ││┌ @ operators.jl:193 within `!='\n; │││┌ @ promotion.jl:398 within `=='\n      %11 = icmp ne i64 %10, 42\n; ││└└\n; ││┌ @ int.jl:858 within `+'\n; │││┌ @ int.jl:442 within `rem'\n; ││││┌ @ number.jl:7 within `convert'\n; │││││┌ @ boot.jl:707 within `Int64'\n; ││││││┌ @ boot.jl:634 within `toInt64'\n         %12 = zext i1 %11 to i64\n; │││└└└└\n; │││ @ int.jl:860 within `+' @ int.jl:53\n     %13 = add nuw nsw i64 %9, %12\n; └└└\n  ret i64 %13\n}\n"))
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\andre\.julia\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
Worker 13 failed running test execution:
Some tests did not pass: 78 passed, 2 failed, 0 errored, 0 broken.
execution: Test Failed at C:\Users\andre\.julia\dev\CUDA\test\execution.jl:617
  Expression: code == 1
   Evaluated: 0 == 1
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\andre\.julia\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484
execution: Test Failed at C:\Users\andre\.julia\dev\CUDA\test\execution.jl:618
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\andre\.julia\dev\CUDA\test\runtests.jl:444
 [3] include(::String) at .\client.jl:439
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:264
 [7] _start() at .\client.jl:484

Test Summary:                     | Pass  Fail  Broken  Total
  Overall                         | 8173     3       2   8178
    apiutils                      |   15                   15
    initialization                |   11                   11
    iterator                      |   30                   30
    memory                        |   10                   10
    codegen                       |   17     1             18
    nnlib                         |    3                    3
    nvml                          |    7                    7
    nvtx                          |                     No tests
    pointer                       |   13                   13
    curand                        |  102                  102
    statistics                    |   12                   12
    broadcast                     |   29                   29
    cufft                         |  151                  151
    utils                         |    5                    5
    cudadrv\errors                |    6                    6
    cudadrv\devices               |    5                    5
    cudadrv\events                |    6                    6
    cudadrv\execution             |   15                   15
    threading                     |                     No tests
    cudadrv\module                |   11                   11
    cudadrv\occupancy             |    1                    1
    cudadrv\profile               |    2                    2
    cudadrv\stream                |    7                    7
    cudadrv\version               |    3                    3
    cudadrv\memory                |   50             1     51
    cudadrv\context               |   12                   12
    device\array                  |   20                   20
    texture                       |   26             1     27
    gpuarrays/math                |    8                    8
    gpuarrays/input output        |    5                    5
    cusparse                      |  468                  468
    forwarddiff                   |  106                  106
    cusolver\cusparse             |   84                   84
    gpuarrays/interface           |    7                    7
    gpuarrays/value constructors  |  120                  120
    device\pointer                |   57                   57
    gpuarrays/conversions         |   72                   72
    gpuarrays/indexing            |  113                  113
    gpuarrays/constructors        |  335                  335
    gpuarrays/uniformscaling      |   56                   56
    array                         |  154                  154
    gpuarrays/fft                 |   12                   12
    cublas                        | 1885                 1885
    gpuarrays/iterator constructors |   24                   24
    cusolver                      | 1493                 1493
    gpuarrays/base                |   38                   38
    gpuarrays/random              |   40                   40
    device\intrinsics             |  265                  265
    gpuarrays/linear algebra      |  393                  393
    examples                      |    7                    7
    gpuarrays/broadcasting        |  155                  155
    execution                     |   78     2             80
    gpuarrays/mapreduce (old tests) |  297                  297
    gpuarrays/mapreduce essentials |  522                  522
    gpuarrays/mapreduce derivatives |  810                  810
    FAILURE

Error in testset codegen:
Test Failed at C:\Users\andre\.julia\dev\CUDA\test\codegen.jl:78
  Expression: !(occursin("gpu_report_exception", ir))
   Evaluated: !(occursin("gpu_report_exception", "\n;  @ C:\\Users\\andre\\.julia\\dev\\CUDA\\test\\codegen.jl:76 within `bar'\n; Function Attrs: uwtable\ndefine i64 @julia_bar_19729(i64) local_unnamed_addr #0 {\ntop:\n; ┌ @ operators.jl:294 within `>'\n; │┌ @ int.jl:49 within `<'\n    %1 = icmp sgt i64 %0, 0\n; └└\n; ┌ @ C:\\Users\\andre\\.julia\\dev\\CUDA\\src\\device\\llvm.jl:5 within `assume'\n   %2 = call fastcc %jl_value_t addrspace(10)* @gpu_gc_pool_alloc(i64 16)\n   %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*\n   %4 = bitcast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(10)* addrspace(10)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1035379872 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %4, align 8\n   %5 = bitcast %jl_value_t addrspace(11)* %3 to i8 addrspace(11)*\n   %6 = getelementptr inbounds i8, i8 addrspace(11)* %5, i64 8\n   %7 = 
bitcast i8 addrspace(11)* %6 to %jl_value_t addrspace(10)* addrspace(11)*\n   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 1033654928 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(11)* %7, align 8\n   call void @llvm.assume(i1 %1)\n; └\n; ┌ @ div.jl:227 within `cld'\n; │┌ @ div.jl:270 within `div' 
@ div.jl:215 @ int.jl:230\n    %8 = icmp eq i64 %0, 0\n    br i1 %8, label %fail, label %pass\n\nfail:
                        ; preds = %top\n    call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception to i64))\n    call fastcc void @gpu_signal_exception()\n    call void asm sideeffect \"exit;\", \"\"() #1\n    br label %pass\n\npass:                                             ; preds = %top, %fail\n    %9 = udiv i64 42, %0\n; ││ @ div.jl:271 within `div'\n; ││┌ @ int.jl:54 within `*'\n     %10 = mul i64 %9, %0\n; ││└\n; ││┌ @ operators.jl:193 within `!='\n; │││┌ @ promotion.jl:398 within `=='\n      %11 = icmp ne i64 %10, 42\n; ││└└\n; ││┌ @ int.jl:858 within `+'\n; │││┌ @ int.jl:442 within `rem'\n; ││││┌ @ number.jl:7 within `convert'\n; │││││┌ @ boot.jl:707 within `Int64'\n; ││││││┌ @ boot.jl:634 within `toInt64'\n         %12 = zext i1 %11 to i64\n; │││└└└└\n; │││ @ int.jl:860 within `+' @ int.jl:53\n     %13 = add nuw nsw i64 %9, %12\n; └└└\n  ret i64 %13\n}\n"))
Error in testset execution:
Test Failed at C:\Users\andre\.julia\dev\CUDA\test\execution.jl:617
  Expression: code == 1
   Evaluated: 0 == 1
Error in testset execution:
Test Failed at C:\Users\andre\.julia\dev\CUDA\test\execution.jl:618
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "error in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
ERROR: LoadError: Test run finished with errors
in expression starting at C:\Users\andre\.julia\dev\CUDA\test\runtests.jl:475

And I double checked. I did get the code needed

Yes, those are the llvm.assume differences I mentioned.