Global optimization using `IntervalArithmetic.jl` and `ForwardDiff.jl` on the GPU with `CUDA.jl`

I’m trying to solve a global optimization problem concerning the parameters of the “Two Compartment Model”. Given an initial parameter search space that’s split into smaller boxes of a predefined maximum width, I’m interested in those boxes that contain zero in their partial derivatives of the error function, in my case a simple sum of least squares. The hull of those boxes – which should be significantly smaller than the initial search space – should then contain the optimal parameters for the problem with given measurement data and a known exact solution for y_2 only.

I already implemented this problem in C++, but now I wanted to try it in Julia, because the search for suitable boxes (the step that should be executed on the GPU) depends on automatic differentiation, which is not that easy to do in CUDA when working with intervals, at least to my knowledge.

The combination of IntervalArithmetic.jl, ForwardDiff.jl and CUDA.jl looked very promising, and theoretically it should have been as easy as writing code that works on the CPU, then wrap my data in a CuArray and Julia handles the rest. But I guess I was mistaken, because I can’t get it to work. This is the Code:

#!/usr/bin/env julia

using ForwardDiff
using IntervalArithmetic
using CUDA
using CRlibm

setrounding(Interval, :accurate)

SPLIT_MAX = 0.05
TOL = 1e-10

# measurement data for y2
y2m = [0.0532, 0.0478, 0.0410, 0.0328,
       0.0323, 0.0148, 0.0216, 0.0127,
       0.0099, 0.0081, 0.0065, 0.0043,
       0.0013, 0.0015, 0.0060, 0.0126]

# y2 of the ODE
function y2(t, p1, p2, p3)
    D  = sqrt((p1-p2+p3)^2 + 4*p2*p3)
    a  = p3 / D
    ps = p1 + p2 + p3
    l1 = 0.5 * (ps-D)
    l2 = 0.5 * (ps+D)
    return a * (exp(-l1*t)-exp(-l2*t))
end

# same as above, but with a box as argument
Ob(p) = sum((y2(t, p[1], p[2], p[3]) - y2m[t])^2 for t in 1:16)

# typecast from box to vector so that gradient() works
Vector(B::IntervalBox) = [X for X in B]
dO = p -> IntervalBox(ForwardDiff.gradient(Ob, Vector(p)))

# monotonicity test
monotest(p) = contains_zero(dO(p))

# bisect initial search space until boxes are no wider than SPLIT_MAX
function split(domain)
	to_split = [domain]
	splits = []

	while !isempty(to_split)
    	current_box = pop!(to_split)
    	if diam(current_box) < SPLIT_MAX
        	push!(splits, current_box)
    	else
       	append!(to_split, bisect(current_box))
    	end
	end

	return splits
end

# search space
P1 = 0.01..2.0
P2 = 0.05..3.0
P3 = 0.05..3.0

P = IntervalBox(P1, P2, P3)

println("Search space:\n\t", P)

splits = split(P)

println("Searching in ", length(splits), " boxes ",
        "with a maximum width of ", SPLIT_MAX, " ...")

# find boxes that contain possible optimum
# XXX on the CPU, this works ...
# mask = monotest.(splits)
# ... but this does not work on the GPU
d_splits = CuArray(splits)
mask = monotest.(d_splits)

# only keep those boxes
interesting = splits[mask]

# get the error for every suitable box
errors = Ob.(interesting)

# get boxes with the smallest acceptable error
best_idx = findall(x->abs(x.lo*mid(x))<=TOL, errors)
best_boxes = [(errors[i], interesting[i]) for i in best_idx]

println("Best boxes:")
for best in best_boxes
	println(best[1], best[2])
end

println("Hull:\n\t", reduce(∪, [b[2] for b in best_boxes]))

… which results in the following error:

ERROR: LoadError: CuArray only supports element types that are stored inline
Stacktrace:
[1] CuArray{Any, 1, CUDA.Mem.DeviceBuffer}(#unused#::UndefInitializer, dims::Tuple{Int64})
@ CUDA ~/.julia/packages/CUDA/0IDh2/src/array.jl:36
[2] CuArray
@ ~/.julia/packages/CUDA/0IDh2/src/array.jl:287 [inlined]
[3] CuArray
@ ~/.julia/packages/CUDA/0IDh2/src/array.jl:292 [inlined]
[4] CuArray(A::Base.Vector{Any})
@ CUDA ~/.julia/packages/CUDA/0IDh2/src/array.jl:301
[5] top-level scope
@ ~/twocomp.jl:79
in expression starting at ~/twocomp.jl:79

It appears that CUDA doesn’t like the list of IntervalBoxes. I already tried splitting my one list of 3d boxes into three lists of 1d boxes, changing the monotest.(d_splits) into broadcast(monotest, d_s1, d_s2, d_s3) (where d_sn are my lists of boxes), but still no success. Instead of the CuArray error I get errors about type instability.

What am I missing here?

My Julia version is 1.6.5 (that’s what Fedora 34 Server Edition provides atm) with CUDA v3.6.2, IntervalArithmetic v0.20.2 and ForwardDiff v0.10.24.

You’re creating a fully untyped array here, that’s GPU incompatible. At first glance IntervalBox should be GPU compatible, as it’s a plain isbits struct.

It’s generally not recommended to use your distros build of Julia, especially not with the GPU back-ends (because of how distros generally want to use a system LLVM, which doesn’t work) . I recommend you use the binary from the Julia home page instead.

2 Likes

Okay, thanks! I will try it out with an official Julia build from the website.

Meanwhile, I changed the array initialization and mask creation to:

d_splits = CuArray{IntervalBox{3, Float64}}(splits)
mask = CuArray{Bool}(undef, length(splits))
mask = monotest.(d_splits)

I no longer get the prior error, but instead a type instability error about the monotest function (line 81 is where it’s called):

ERROR: LoadError: GPU broadcast resulted in non-concrete element type Any.
This probably means that the function you are broadcasting contains an error or type instability.
Stacktrace:
[1] copy
@ ~/.julia/packages/GPUArrays/gkF6S/src/host/broadcast.jl:44 [inlined]
[2] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(monotest), Tuple{CuArray{IntervalBox{3, Float64}, 1, CUDA.Mem.DeviceBuffer}}})
@ Base.Broadcast ./broadcast.jl:883
[3] top-level scope
@ ~/twocomp.jl:81
in expression starting at ~/twocomp.jl:81

So I gave monotest an explicit return type Bool since it returns a bit vector, but now this message appears:

ERROR: LoadError: InvalidIRError: compiling kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceVector{Bool, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(monotest), Tuple{Base.Broadcast.Extruded{CuDeviceVector{IntervalBox{3, Float64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation
Stacktrace:
[1] monotest
@ ~/twocomp.jl:34
[2] _broadcast_getindex_evalf
@ ./broadcast.jl:648
[3] _broadcast_getindex
@ ./broadcast.jl:621
[4] getindex
@ ./broadcast.jl:575
[5] broadcast_kernel
@ ~/.julia/packages/GPUArrays/gkF6S/src/host/broadcast.jl:59
Reason: unsupported dynamic function invocation (call to contains_zero)
Stacktrace:
[1] monotest
@ ~/twocomp.jl:34
[2] _broadcast_getindex_evalf
@ ./broadcast.jl:648
[3] _broadcast_getindex
@ ./broadcast.jl:621
[4] getindex
@ ./broadcast.jl:575
[5] broadcast_kernel
@ ~/.julia/packages/GPUArrays/gkF6S/src/host/broadcast.jl:59
Reason: unsupported dynamic function invocation (call to convert)
Stacktrace:
[1] monotest
@ ~/twocomp.jl:34
[2] _broadcast_getindex_evalf
@ ./broadcast.jl:648
[3] _broadcast_getindex
@ ./broadcast.jl:621
[4] getindex
@ ./broadcast.jl:575
[5] broadcast_kernel
@ ~/.julia/packages/GPUArrays/gkF6S/src/host/broadcast.jl:59
Stacktrace:
[1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CuDeviceVector{Bool, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(monotest), Tuple{Base.Broadcast.Extruded{CuDeviceVector{IntervalBox{3, Float64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, args::LLVM.Module)
@ GPUCompiler ~/.julia/packages/GPUCompiler/HeCT6/src/validation.jl:111
[2] macro expansion
@ ~/.julia/packages/GPUCompiler/HeCT6/src/driver.jl:326 [inlined]
[3] macro expansion
@ ~/.julia/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]
[4] macro expansion
@ ~/.julia/packages/GPUCompiler/HeCT6/src/driver.jl:324 [inlined]
[5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
@ GPUCompiler ~/.julia/packages/GPUCompiler/HeCT6/src/utils.jl:64
[6] cufunction_compile(job::GPUCompiler.CompilerJob)
@ CUDA ~/.julia/packages/CUDA/0IDh2/src/compiler/execution.jl:326
[7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob,compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/HeCT6/src/cache.jl:90
[8] cufunction(f::GPUArrays.var"#broadcast_kernel#17", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Bool, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(monotest), Tuple{Base.Broadcast.Extruded{CuDeviceVector{IntervalBox{3, Float64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ CUDA ~/.julia/packages/CUDA/0IDh2/src/compiler/execution.jl:297
[9] cufunction
@ ~/.julia/packages/CUDA/0IDh2/src/compiler/execution.jl:291 [inlined]
[10] macro expansion
@ ~/.julia/packages/CUDA/0IDh2/src/compiler/execution.jl:102 [inlined]
[11] #launch_heuristic#240
@ ~/.julia/packages/CUDA/0IDh2/src/gpuarrays.jl:17 [inlined]
[12] copyto!
@ ~/.julia/packages/GPUArrays/gkF6S/src/host/broadcast.jl:65 [inlined]
[13] copyto!
@ ./broadcast.jl:936 [inlined]
[14] copy
@ ~/.julia/packages/GPUArrays/gkF6S/src/host/broadcast.jl:47 [inlined]
[15] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(monotest), Tuple{CuArray{IntervalBox{3, Float64}, 1, CUDA.Mem.DeviceBuffer}}})
@ Base.Broadcast ./broadcast.jl:883
[16] top-level scope
@ ~/twocomp.jl:81
in expression starting at ~/twocomp.jl:81

But like I said, I will set up a new environment and try again.

You generally can’t use broadcast within a kernel, so within monotest.

Okay, so is there some implicit broadcast happening inside the body of monotest that I’m not aware of?

Yes contains_zero on a static vector or an intervalbox leads to a broadcast

https://github.com/JuliaIntervals/IntervalArithmetic.jl/blob/75f53c0c30d484fc79707e35503416bc7c23d9db/src/multidim/intervalbox.jl#L102

I see …

I rewrote the monotest function to not use contains_zero or boxes:

dOdp1(p1, p2, p3) = ForwardDiff.derivative(p1 -> O(p1, p2, p3), p1)
dOdp2(p1, p2, p3) = ForwardDiff.derivative(p2 -> O(p1, p2, p3), p2)
dOdp3(p1, p2, p3) = ForwardDiff.derivative(p3 -> O(p1, p2, p3), p3)

function monotest(p)::Bool
    p1 = p[1]
    p2 = p[2]
    p3 = p[3]
    return 0 in dOdp1(p1, p2, p3) || 0 in dOdp2(p1, p2, p3) || 0 in dOdp3(p1, p2, p3)
end

… and then this huge stacktrace happenend (too big to paste it here, so here’s a link to it):
https://gist.github.com/lorenzgillner/11b421a35264215649280cfe3416d4b2

Is it safe to assume that at some point one of the libraries I’m using calls a function that’s incompatible with CUDA, and that I would have to familiarize myself with the source code of those libraries first? For example, the internal reduce function seems to cause some errors in the current context … but as someone who is fairly new to the language, I find this stacktrace quite overwhelming.

Btw, I am using the most up to date version of Julia now.

One problem is that you need to use SVectors, not normal Julia vectors. See e.g. here. ForwardDiff can handle SVectors fine.

Probably that definition of contains_zero should be contains_zero(X::IntervalBox) = any(contains_zero, X); this didn’t used to work but should now.