That’s weird. What happens if you do CUDAnative.device!(0) first? What’s the output of CUDAdrv.CuDevice(0) (after importing the relevant packages of course)?
Thanks for your immediate reply @maleadt! Calling CUDAnative.device!(0) gives the same error:
julia> using CUDAnative, CUDAdrv, CuArrays
julia> CUDAnative.device!(0)
ERROR: CUDA error: invalid device ordinal (code #101, ERROR_INVALID_DEVICE)
Stacktrace:
[1] macro expansion at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/base.jl:147 [inlined]
[2] CuContext(::CuDevice, ::CUDAdrv.CUctx_flags) at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/context.jl:112
[3] Type at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/context.jl:111 [inlined]
[4] device!(::CuDevice) at /users/omlins/.julia/packages/CUDAnative/AGfq2/src/init.jl:68
[5] device!(::Int64) at /users/omlins/.julia/packages/CUDAnative/AGfq2/src/init.jl:75
[6] top-level scope at none:0
However, for some reason, calling CUDAdrv.CuDevice(0) solves the problem for the call to CuArrays.CuArray() (but not for the call to CUDAnative.device!()):
julia> using CUDAnative, CUDAdrv, CuArrays
julia> CUDAdrv.CuDevice(0)
CuDevice(0): Tesla P100-PCIE-16GB
julia> CUDAnative.device!(0)
ERROR: CUDA error: invalid device ordinal (code #101, ERROR_INVALID_DEVICE)
Stacktrace:
[1] macro expansion at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/base.jl:147 [inlined]
[2] CuContext(::CuDevice, ::CUDAdrv.CUctx_flags) at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/context.jl:112
[3] Type at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/context.jl:111 [inlined]
[4] device!(::CuDevice) at /users/omlins/.julia/packages/CUDAnative/AGfq2/src/init.jl:68
[5] device!(::Int64) at /users/omlins/.julia/packages/CUDAnative/AGfq2/src/init.jl:75
[6] top-level scope at none:0
julia> B = ones(10_000) ./ 2;
julia> cuB = CuArrays.CuArray(B);
julia>
Why do you think that? It’s the initialization itself that is failing, so I’m not sure what’s up.
Well, due the fact that after a first failing call to CuArray() everything seems to work as it should, I thought that the required initialization is triggered and successful, but occurs somehow too late. But as I do not know at all how it is implemented, I don’t think I can be of any help in reasoning about why the error occurs. I can however do little tests that you tell me to do in order to track down the source of the error
Here is the test you asked for:
julia> using CUDAdrv
julia> CuContext(CuDevice(0))
CuContext(Ptr{Nothing} @0x0000000000e75c90, true, true)
julia> CuContext(CuDevice(0))
ERROR: CUDA error: invalid device ordinal (code #101, ERROR_INVALID_DEVICE)
Stacktrace:
[1] macro expansion at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/base.jl:147 [inlined]
[2] CuContext(::CuDevice, ::CUDAdrv.CUctx_flags) at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/context.jl:112
[3] CuContext(::CuDevice) at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/context.jl:111
[4] top-level scope at none:0
julia> CuContext(CuDevice(0))
ERROR: CUDA error: invalid device ordinal (code #101, ERROR_INVALID_DEVICE)
Stacktrace:
[1] macro expansion at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/base.jl:147 [inlined]
[2] CuContext(::CuDevice, ::CUDAdrv.CUctx_flags) at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/context.jl:112
[3] CuContext(::CuDevice) at /users/omlins/.julia/packages/CUDAdrv/LC5XS/src/context.jl:111
[4] top-level scope at none:0
That’s is really weird. Something seems up with your GPU/set-up. Could try running that code under cuda-memcheck and see if the error comes from another, hidden API call (because context creation can return error codes from other API calls). It really should not behave like that, and if it does, I’m not sure I can do anything about it.
Nothing new there, the memcheck trace corresponds with our stack trace.
I see your libcuda is in /opt/cray, are you by any chance using some exotic system? Virtualized GPU? Are you sure libcuda matches the loaded kernel driver? Normally libcuda is in /lib because it is tied very strongly to your kernel and its modules.
Again, this API behavior is highly unusual and not documented, so I’m suspecting a deeper issue here.
I see your libcuda is in /opt/cray, are you by any chance using some exotic system? Virtualized GPU?
It is a CRAY system; the programming environment including CUDA is provided by CRAY. No, the GPUs are not virtualized.
Are you sure libcuda matches the loaded kernel driver?
It looks like it is right; libcuda.so links to libnvidia-fatbinaryloader.so.396.44, where 396.44 corresponds to the CUDA driver version (see above in the topic description in the nvidia-smi output):
Here is a little correction to my post from Dec 14, 12:28 above:
I noted that
calling CUDAdrv.CuDevice(0) solves the problem for the call to CuArrays.CuArray() (but not for the call to CUDAnative.device!())
- Unfortunately this is not true. It was the precedent (failing!) call to CUDAnative.device!(0) that made the call to CuArrays.CuArray() succeed. As you can see, CuArrays.CuArray() fails if only CUDAdrv.CuDevice(0) is called first:
julia> using CUDAnative, CUDAdrv, CuArrays
julia> CUDAdrv.CuDevice(0)
CuDevice(0): Tesla P100-PCIE-16GB
julia> B = ones(10_000) ./ 2;
julia> cuB = CuArrays.CuArray(B);
ERROR: CUDA error: invalid device ordinal (code #101, ERROR_INVALID_DEVICE)
(...)
However, CuArrays.CuArray() succeeds if CUDAnative.device!(0) is called first, even though this call itself fails persistently!:
julia> using CUDAnative, CUDAdrv, CuArrays
julia> CUDAnative.device!(0)
ERROR: CUDA error: invalid device ordinal (code #101, ERROR_INVALID_DEVICE)
(...)
julia> B = ones(10_000) ./ 2;
julia> cuB = CuArrays.CuArray(B);
julia> CUDAnative.device!(0)
ERROR: CUDA error: invalid device ordinal (code #101, ERROR_INVALID_DEVICE)
(...)
To sum up, so far there are two ways to make a call to CuArrays.CuArray() succeed:
call first CUDAnative.device!(0) once
call first CuArrays.CuArray() itself once
Unfortunately, none of these two ways are good workarounds as both calls cause an error, which makes a script abort. @maleadt: would you have an idea for a different possible workaround which would not cause an error?
Just a short additional remark: I have been experimenting with CUDAnative lately and have not run into any problems with that so far. So I believe that the problem is specific to CuArrays.
After looking into this a little, I’m thinking you guys are using a GPU in process exclusive mode, where only a single context can get created. Please verify with:
That means the call to CuContext(CuDevice(0)) is expected to fail the second time, nothing wrong there.
However, going back to the original problem of @samo, CuArrays should only ever create a single context, and cache it afterwards. I can’t seem to reproduce the issue, although I found a redundant context initialization that might cause some issues. Could you guys try https://github.com/JuliaGPU/CUDAnative.jl/pull/312?
EDIT: if that doesn’t work, please re-run failing code with the following patch applied:
CUDAdrv$ git diff
diff --git a/src/context.jl b/src/context.jl
index 3d7be17..a8f69ef 100644
--- a/src/context.jl
+++ b/src/context.jl
@@ -108,6 +108,9 @@ Base.deepcopy_internal(::CuContext, ::IdDict) =
error("CuContext cannot be copied")
function CuContext(dev::CuDevice, flags::CUctx_flags=SCHED_AUTO)
+ @info "Creating a CuContext" dev flags
+ Base.show_backtrace(stderr, stacktrace())
+ println(stderr)
handle_ref = Ref{CuContext_t}()
@apicall(:cuCtxCreate, (Ptr{CuContext_t}, Cuint, Cint),
handle_ref, flags, dev)
You can pkg> add CUDAnative#tb/version_init, but that branch might be incompatible with the rest of your set-up (specifically, if you are working with Flux). You’d need to upgrade CuArrays to master as well pkg> add CuArrays#master. Alternatively, just apply the fix, it’s a singly-line change.
@maleadt: Sorry for the basic Julia-newcomer questions: to do the “single line change”, I modify
.julia/packages/CUDAnative/AGfq2/src/init.jl as in your PR and then do Pkg.build(“CUDAnative”)?