I am trying to pass to a kernel a vector of Union type. It seems to work if I limit the number of types in the Union to 3, but using 4 it fails with unsupported call to an unknown function (call to gpu_gc_pool_alloc). This is the code to reproduce:
using CUDA
struct Box{T<:AbstractFloat}
x::T
y::T
z::T
end
struct Sphere{T<:AbstractFloat}
r::T
end
struct Tube{T<:AbstractFloat}
r::T
z::T
end
struct Cone{T<:AbstractFloat}
r::T
z::T
end
volume(b::Box{T}) where T = b.x * b.y * b.z
volume(s::Sphere{T}) where T = T(4)/3 * Ď * s.r^3
volume(t::Tube{T}) where T = T(Ď) * t.r^2 * t.z
volume(c::Cone{T}) where T = T(1)/3 * Ď * c.r^2 * c.z
const T = Float32
shapes = Vector{Union{Box{T}, Sphere{T}, Tube{T}, Cone{T}}}()
#shapes = Vector{Union{Box{T}, Sphere{T}, Tube{T}}}()
push!(shapes, Box{T}(1,2,3))
push!(shapes, Sphere{T}(1))
function kernel(shapes)
for s in shapes
@cuprintln volume(s)
end
return nothing
end
cu_shapes = CuVector(shapes)
@cuda kernel(cu_shapes)
Is there a builtin limitation? Is there a way to achieve what I would like to do with more shapes than 3?
Many thanks in advance.
No. I am not using Abstract types. I am using Union as suggested and worked nicely for at least 6-7 types on the CPU. Now it fails with just 4 when moving to GPU.
Union splitting only works up to 3 element types, this is probably controled by https://github.com/JuliaLang/julia/blob/master/base/compiler/typelimits.jl. You can test this out with f(x) = x + one(eltype(x)) and code_warntype(f, Tuple{Union{Int}}) (or code_llvm); try adding more types to that Union and observe the generated code changing with 4 element types.
On the GPU, that kind of dynamic code is badly supported. It actually should work, we have an implementation of that gc_pool_alloc intrinsic (so youâre probably using old packages which had a bug in lowering them), but you want to avoid that anyway since allocating in kernels is extremely expensive.
Thanks very much for the explanation. Strange, I am using CUDA v3.8.0 and Julia 1.7.1
In this case, is the only efficient solution to have separate vectors for each shape type?
Someone correct me if Iâm wrong (probably the case, I donât know GPU stuff that well). I was under the impression that dispatching a method on a splittable Union is a branch to method calls on each type, and that branch would cause warp divergence on the GPU because GPU threads have to do the same instructions in parallel. Consequently, an N-sized Union element will have N-1 discarded method calls, which is less efficient with more expensive methods or larger N (though the N in Union-splitting is already limited to 3, maybe 4 in some cases).
You are in principle right, but this is not exactly the case. âthread divergenceâ a dynamic issue. Here is the impossibility to provide to the GPU an array of âpolymorphic typesâ. In my case the arguments to the kernel are: an array of uniform âpointsâ or âraysâ, together with a static description of a scene (made of different shapes). In principle the processing for each point/ray could be completely in sync using/interpreting each shape in the array. So, the two issues are somehow uncorrelated.
Right, what I commented isnât the issue youâre running into here, Iâm speculating about an inefficiency under the Union-splitting limit where the code still works. The main issue is, as mentioned earlier, that exceeding the Union-splitting limit will be compiled to a dynamic dispatch, so the kernel doesnât work.
There is a way to force static dispatch on the CPU beyond the Union-splitting limit: a conditional check whether a type-unstable variable is a particular concrete type will cause static dispatch on that variable in the branch. ManualDispatch.jlâs @unionsplit macro is an easy way to convert a function call to an if-elseif statement checking its arguments for some concrete types before running the same function call in each branch. Bear in mind: without assuming anything about the distribution of concrete types in order to optimize the order of type checks, checking N types can take longer than dynamic dispatch on average when N gets large enough.
Thing is, some type-unstable variables must exist for the type check (in your case, s or shapes[i]), and Iâm not sure if thatâs okay for CUDA.jl kernels.
just to summarize and see if Iâm getting this, fundamentally because the kernel has to do the same thing in each GPU threads if you pass a vector of union types, the âsame thingâ is just dynamic if type == for N times. Which can and should be a fallback in any case (or can be manually enabled like scalar indexing).
But the performance at the end of the day will take a hit, especially if N is bigger, and in this case, the Juli base has a heuristics to stop union splitting, which currently causes this code to not wrong because the fall-back is not working?
btw, @maleadt I have tested on various combinations, including nightly Julia and #master of CUDA, they all error with:
InvalidIRError: compiling kernel kernel(CuDeviceVector{Union{Box{Float32}, Cone{Float32}, Sphere{Float32}, Tube{Float32}}, 1}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to gpu_gc_pool_alloc)
For the issue from this thread, it is possible to overcome the Julia codegen limitation by doing manual union splitting:
function kernel(::Type{T}, shapes) where {T}
for s in shapes
if s isa Box{Float32}
@cuprintln "Box: $(volume(s))"
elseif s isa Sphere{T}
@cuprintln "Sphere: $(volume(s))"
elseif s isa Tube{T}
@cuprintln "Tube: $(volume(s))"
elseif s isa Cone{T}
@cuprintln "Cone: $(volume(s))"
else
@cuprintln "Unknown shape"
end
end
return nothing
end
@cuda kernel(T, cu_shapes)
However, (1) that seems to expose a bug in CUDA.jl (Manual union splitting past limits leads to incorrect codegen ¡ Issue #1385 ¡ JuliaGPU/CUDA.jl ¡ GitHub) and (2) the code generated by Julia still performs a dynamic allocation (which are very bad, since we donât have an on-device GC so canât free this memory). Iâm not entirely sure where that allocation comes from though; I could look into it, but it seems like a programming pattern that doesnât heavily rely on Unions (which are hard on the compiler) would be a better choice.
I noticed that if you index the shapes array instead of iterating it, the manual union splitting case doesnât iterate on the CPU, but does with CUDA.jl. That means this pattern could be viable to use in kernels, so Iâve opened an issue about it Generated code issue with manual union splitting ¡ Issue #1386 ¡ JuliaGPU/CUDA.jl ¡ GitHub, but the fact that simple changes like iterating instead of indexing causes allocations on the CPU too shows that this is a fragile approach and it may be better to use something thatâs easier on the compilers.
Thank-you for your interest and the efforts to improve CUDA.jl. I am new in Julia. I am just exploring the limitations that we could encounter when migrating our High Energy Physics code to this language. One example is a geometry modelling software based on a hierarchical solid model. I understand the concerns about the performance, but my current goal to see the feasibility of moving the code and how easy would be to use a single language (Julia) for GPU and CPU instead of the tandem C++/Cuda.
Concerning your manual union splitting. There is something I do not understand. At the end, for each type you always call volume(s) in which s is always an element of the vector of Union. What is this different than just calling volume(s) without the if. How the the compiler knows that you have checked the type before calling it? Probably this shows my ignorance, but in C/C++ you would instruct the compiler with a type cast like volume( static_cast<Type>(s)) or similar.
I donât know C/C++, but IIRC static dispatch vs virtual/dynamic/runtime dispatch are separate mechanisms, and you have to manually implement the latter. In Julia, the compiler takes the input argumentsâ concrete types, infers types of variables throughout the method, and handles the dispatch of inner method calls. If it can infer variablesâ concrete types at compile-time, then inner method calls can be statically dispatched; when it canât, it does dynamic dispatch (relatively slow table lookup). There is an in-between optimization called Union-splitting where if it can figure out a variable must be 3-4 concrete types, it will do something like an if-statement checking for the types; within each branch, the checked variable must be a specific concrete type, so the compiler can do static dispatch again.
That example kernel writes such an if-statement manually because the compiler wonât go beyond the 3-4 limit. As I mentioned before, this is partially because an if-statement over too many types will become slower than even dynamic dispatch. Itâs also because compiling so many branches generates lots of code and takes time to do it. Long compile times and high memory usage isnât something you want to happen automatically.