Passing mutable struct to kernel

Hi,

I am looking for the best way to use structs inside the kernels and I am having hard time with understanding whole mutable/immutable/isbits/inline thing. My goal is to be able to pass struct as an argument to a kernel and be able modify it.

At first my struct looked like that:

mutable struct MyStruct
  a :: CuArray{Float64}
  b :: Int64
end

and when I wanted to modify it inside the kernel I just did:

s = MyStruct(CUDA.rand(1,100), 20)
@cuda threads=10 kernel(s.a, s.b)

As my original structure is much more complicated and have lots of field then I started wondering if I could maybe pass whole structure to a kernel without unpacking it to separate arguments.
So I created adaptor and modified struct definition accordingly:

Adapt.adapt_structure(to, s::MyStruct) = MyStruct(adapt(to, s.a), s.b)

mutable struct MyStruct{T}
  a :: T
  b :: Int64
end

I incorrectly assumed that MyStruct would be bits type once all its fields are (after conversion of CuArray to CuDeviceArray). But then I found out that once the struct is mutable it can’t be bitstype even in such a situtation.
This won’t be a big problem if I would modify only field a (CuArray) inside the kernel (as it is possible). Then I would just change structure to immutable:

struct MyStruct{T}
  a :: T
  b :: Int64
end

but I also need to modify the Integer value, which throws an error because immutable struct cannot be changed.
I had some ideas how to workaround it:

  • replace all the immutable types with one-element CuArrays - it works but from my perspective looks horrible and accessing it like: s.b[1] requires scalar indexing…
  • using b :: Base.RefValue{Int} in struct definition (and then s.b[] = 5 for value change) which works for replacing values in immutable struct but is not bits type so it makes the structure non isbits.
  • using Setfield.jl which seems not to work inside kernel (although it doesn’t throw an error)

I read this topic and I think that I understood why the things with mutable/immutable structs are like that. But from my perspective and current knowledge I think that I am missing something, especially in context of CUDA.jl. For know I don’t see any possibility to pass the struct to the kernel and be able to modify all its fields. Structure passed can not be mutable as it is not isbits (which is understandable regarding the arrangement of data in it - and also we could not change it, by design), it has to be immutable (to force proper arrangement of data in memory). But then it will protect all its immutable fields (and at the same time we can modify fields which are CuArrays, because these are mutable). So basically I have to sacrifice the possibility to change my Integer field in order to be able to even pass the structure to the kernel, which seems unnecessary.
Or maybe there is some solution to this? I would be glad to find out or at least understand why it has to be like that :slight_smile:

1 Like

It really has to be immutable, so you’ll have to find a way to conveniently mutate the b field. Why doesn’t a Ref work? You have to make the field parametric and be sure to adapt it, but if you do so you should get something GPU-compatible (although you won’t be able to mutate the field from the GPU, if you need that you’ll have to use a CuArray field):

julia> cudaconvert(Ref(1))
CUDA.CuRefValue{Int64}(1)

Thank you for your help. I don’t know why but I assumed that Base.RefValue don’t have the GPU equivalent so I didn’t even tried to cudaconvert it, I just checked that Base.RefValue is not isbitstype and abandoned the idea :man_facepalming:

So it works now:

struct MyStruct{A,B}
  a :: A
  b :: B
end

Adapt.@adapt_structure MyStruct

s = MyStruct(CUDA.rand(1,100),Ref(20))

isbits(cudaconvert(s)) # true

@cuda threads=10 kernel(s) # works

In my case I can live without being able to modify s.b inside kernel, but still I think that it should be possible somehow, and one-element CuArrays do not seem to be an elegant way. Could you maybe explain a little bit more why it is like that? For me it just seems it would be a great feature to have. I am wondering if it is only me missing something and it is not really necessary to have mutable structs inside kernels or this is the effect of the convention in Julia to leave all the allocation decisions of mutable structs to compiler, and as the result users not being able to force bitstype allocation?

I also have a follow-up question. I wanted a little bit more descriptive definition of MyStruct and I was wondering if maybe it is possible to avoid using parametric types? As I mentioned before I have many fields and also many different types in the original struct so I need to use a lot of parametric types in the declaration…
As CuArray and CuDeviceArray are of course subtypes of AbstractArray and Base.RefValue and CUDA.CuRefValue of Ref I tried something like that:

struct MyStruct
  a :: AbstractArray{Float32}
  b :: Ref{Int64}
end

Adapt.@adapt_structure MyStruct

, which at first seemed to work fine:

s = MyStruct(CUDA.rand(1,100),Ref(20))  # MyStruct(Float32[0.024462976 0.6974687 … 0.39147946 0.9107673], Base.RefValue{Int64}(20))
cudaconvert(s)                          # MyStruct(1Ă—100 device array at 0x0000000203a00000, CUDA.CuRefValue{Int64}(20))
typeof(cudaconvert(s).a)                # CuDeviceMatrix{Float32, 1} (alias for CuDeviceArray{Float32, 2, 1}) 
typeof(cudaconvert(s).b)                # CUDA.CuRefValue{Int64}   
isbits(cudaconvert(s).a)                # true                                
isbits(cudaconvert(s).b)                # true                                                                          

, but:

julia> isbits(cudaconvert(s))                                                                                           
false   

and as expected it can’t be passed to a kernel, which throws that arguments are of types AbstractArray and Ref so not isbits.
I was also thinking about something like a :: Union{CuDeviceArray{Float32}, CuArray{Float32}, but it does not work as well.
Is there any solution to this or I have to stick to parametric types?

a RefValue is per definition a value boxed in CPU memory, so it’s not possible to modify that from the GPU. Well, it is possible to expose the CPU memory to the GPU by registering the page, but that’s too magical for what it’s worth. The functionality is only supported because broadcast uses Ref to indicate which arguments to treat as a scalar, otherwise I’d just have CUDA.jl refuse to pass Ref values (again, by definition CPU memory) to GPU kernels.

That doesn’t work because your types aren’t fully specified; CuArray has more typevars. You need to specify them all if you want the type to be concrete.

IIRC there’s packages that avoid having to specify type parameters manually, by wrapping the definition in a macro, but I can’t seem to find those in my history.

So it should work when fully specified?

I did:

struct MyStruct
  a :: Union{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuDeviceArray{Float32, 1, 1}}
end

Adapt.@adapt_structure MyStruct

s = MyStruct(CUDA.rand(100))

function kernel(f)                                                                                         
  i = threadIdx().x                                                                                        
  f.a[i] = 5                                                                                               
  return nothing                                                                                           
end   

and it crashes my REPL (sic!):

julia> @cuda threads=100 kernel(s)      
Assertion failed: isSpecialPtr(V->getType()), file /cygdrive/c/buildbot/worker/package_win64/build/src/llvm-late-g
c-lowering.cpp, line 802                                                                                          
                                                                                                                  
signal (22): SIGABRT                                                                                              
in expression starting at REPL[11]:1                                                                              
crt_sig_handler at /cygdrive/c/buildbot/worker/package_win64/build/src\signals-win.c:93                           
raise at C:\Windows\System32\msvcrt.dll (unknown line)                                                            
abort at C:\Windows\System32\msvcrt.dll (unknown line)                                                            
assert at C:\Windows\System32\msvcrt.dll (unknown line)                                                           
Number at /cygdrive/c/buildbot/worker/package_win64/build/src\llvm-late-gc-lowering.cpp:802                       
GetPHIRefinements at /cygdrive/c/buildbot/worker/package_win64/build/src\llvm-late-gc-lowering.cpp:1228           
LocalScan at /cygdrive/c/buildbot/worker/package_win64/build/src\llvm-late-gc-lowering.cpp:1623                   
runOnFunction at /cygdrive/c/buildbot/worker/package_win64/build/src\llvm-late-gc-lowering.cpp:2661               
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at C:\Users\lodygaw\AppData\Local\Programs\Julia-1.6.4\bin\L
LVM.dll (unknown line)                                                                                            
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at C:\Users\lodygaw\AppData\Local\Programs\Julia-1.6.4\bin\LLVM.
dll (unknown line)                                                                                                
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at C:\Users\lodygaw\AppData\Local\Programs\Julia-1.6.4\bin\LLVM.
dll (unknown line)                                                                                                
LLVMRunPassManager at C:\Users\lodygaw\AppData\Local\Programs\Julia-1.6.4\bin\LLVM.dll (unknown line)             
LLVMRunPassManager at C:\Users\lodygaw\.julia\packages\LLVM\wnejv\lib\11\libLLVM_h.jl:4437 [inlined]              
run! at C:\Users\lodygaw\.julia\packages\LLVM\wnejv\src\passmanager.jl:39 [inlined]                               
#80 at C:\Users\lodygaw\.julia\packages\GPUCompiler\AJD5L\src\optim.jl:210                                        
#ModulePassManager#51 at C:\Users\lodygaw\.julia\packages\LLVM\wnejv\src\passmanager.jl:33                        
unknown function (ip: 000000000227fcc1)                                                                           
ModulePassManager at C:\Users\lodygaw\.julia\packages\LLVM\wnejv\src\passmanager.jl:31                            
optimize! at C:\Users\lodygaw\.julia\packages\GPUCompiler\AJD5L\src\optim.jl:180                                  
macro expansion at C:\Users\lodygaw\.julia\packages\GPUCompiler\AJD5L\src\driver.jl:266 [inlined]                 
macro expansion at C:\Users\lodygaw\.julia\packages\TimerOutputs\SSeq1\src\TimerOutput.jl:252 [inlined]           
macro expansion at C:\Users\lodygaw\.julia\packages\GPUCompiler\AJD5L\src\driver.jl:265 [inlined]                 
macro expansion at C:\Users\lodygaw\.julia\packages\TimerOutputs\SSeq1\src\TimerOutput.jl:252 [inlined]           
macro expansion at C:\Users\lodygaw\.julia\packages\GPUCompiler\AJD5L\src\driver.jl:263 [inlined]                 
#emit_llvm#109 at C:\Users\lodygaw\.julia\packages\GPUCompiler\AJD5L\src\utils.jl:62                              
unknown function (ip: 000000005595b19b)                                                                           
emit_llvm at C:\Users\lodygaw\.julia\packages\GPUCompiler\AJD5L\src\utils.jl:60 [inlined]                         
cufunction_compile at C:\Users\lodygaw\.julia\packages\CUDA\YpW0k\src\compiler\execution.jl:325                   
cached_compilation at C:\Users\lodygaw\.julia\packages\GPUCompiler\AJD5L\src\cache.jl:89                          
#cufunction#207 at C:\Users\lodygaw\.julia\packages\CUDA\YpW0k\src\compiler\execution.jl:297                      
cufunction at C:\Users\lodygaw\.julia\packages\CUDA\YpW0k\src\compiler\execution.jl:291                           
unknown function (ip: 00000000559beaa2)                                                                           
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1703 [inlined]                            
do_call at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:115                                  
eval_value at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:204                               
eval_body at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:435                                
jl_interpret_toplevel_thunk at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:670              
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:877                       
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:825                       
eval_body at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:525                                
eval_body at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:490                                
jl_interpret_toplevel_thunk at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:670              
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:877                       
jl_toplevel_eval at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:886 [inlined]                  
jl_toplevel_eval_in at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:929                         
eval at .\boot.jl:360 [inlined]                                                                                   
eval_user_input at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\REPL\src\REPL.jl:139        
repl_backend_loop at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\REPL\src\REPL.jl:200      
start_repl_backend at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\REPL\src\REPL.jl:185     
#run_repl#42 at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\REPL\src\REPL.jl:317           
run_repl at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\REPL\src\REPL.jl:305               
#875 at .\client.jl:387                                                                                           
jfptr_YY.875_27301.clone_1 at C:\Users\lodygaw\AppData\Local\Programs\Julia-1.6.4\lib\julia\sys.dll (unknown line)
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1703 [inlined]                            
jl_f__call_latest at /cygdrive/c/buildbot/worker/package_win64/build/src\builtins.c:714                           
#invokelatest#2 at .\essentials.jl:708 [inlined]                                                                  
invokelatest at .\essentials.jl:706 [inlined]                                                                     
run_main_repl at .\client.jl:372                                                                                  
exec_options at .\client.jl:302                                                                                   
_start at .\client.jl:485                                                                                         
jfptr__start_28854.clone_1 at C:\Users\lodygaw\AppData\Local\Programs\Julia-1.6.4\lib\julia\sys.dll (unknown line)
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1703 [inlined]                            
true_main at /cygdrive/c/buildbot/worker/package_win64/build/src\jlapi.c:560                                      
repl_entrypoint at /cygdrive/c/buildbot/worker/package_win64/build/src\jlapi.c:702                                
mainCRTStartup at /cygdrive/c/buildbot/worker/package_win64/build/cli\loader_exe.c:51                             
BaseThreadInitThunk at C:\Windows\System32\KERNEL32.DLL (unknown line)                                            
RtlUserThreadStart at C:\Windows\SYSTEM32\ntdll.dll (unknown line)                                                
Allocations: 60802736 (Pool: 60784167; Big: 18569); GC: 62          

Of course struct MyStruct{T} a::T end works as expected.

Sorry for the false hope; that won’t work. Even a fully-typed CuArray isn’t a bitstype (it needs finalizers, so is marked mutable). I also haven’t looked into how code is generated for isbits union structs, but since isbitstype still returns false for those the GPU compiler won’t allow them. The segfault is unexpected though.

That’s a pity!

If somebody was really determined to avoid using parametric types or to have specifically typed fields, I was able to find a workaround by defining two structures. Although the quality of this solution should be evaluated by someone more knowledgeable than me.

struct MyStruct
  a :: CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}
end

struct MyDeviceStruct
  a :: CuDeviceArray{Float32, 1, 1}
end

Adapt.adapt_structure(to, s::MyStruct) = MyDeviceStruct(adapt(to, s.a))

s = MyStruct(CUDA.rand(100))

isbits(cudaconvert(s)) # true

So, to sum up for other people who might be interested:
As for now, if you want to pass struct to a kernel:

  • it has to be immutable (you can use Base.RefValue to be able to change values of immutable fields (like Int64) on the CPU),
  • it has to have fully parametric field types (or you have to look for something like above),
  • if you wan’t to change value of the field of your struct inside kernel it has to be CuArray (mutating CUDA.CuRefValue won’t work).

Thank you @maleadt! I really appreciate your help. You guys are doing awesome work with CUDA.jl, day after day it gets only better and better.

2 Likes

How would you do this if you had multiple fields in MyStruct, for example b of the same type?

And if you have for example a field c with a third type?

Sorry for reviving this thread, but this is exactly what I want to do, I just cant figure out how to expand it.

EDIT: As usual just after asking, it became clear to me :slight_smile:

using CUDA
using StaticArrays
using Adapt

struct MyStruct
    a :: CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}
    b :: CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}
    c :: CuArray{SVector{3,Float32}, 1, CUDA.Mem.DeviceBuffer}
end
  
struct MyDeviceStruct
  a :: CuDeviceArray{Float32, 1, 1}
  b :: CuDeviceArray{Float32, 1, 1}
  c :: CuDeviceArray{SVector{3,Float32}, 1, 1}
end
  
Adapt.adapt_structure(to, s::MyStruct) = MyDeviceStruct(adapt(to, s.a),adapt(to, s.b),adapt(to, s.c))
  
s = MyStruct(CUDA.rand(100),CUDA.rand(100),CUDA.rand(SVector{3,Float32},100))
  
isbits(cudaconvert(s)) # true

Kind regards

1 Like