I am trying to write a program that solves a set of partial differential equations. This includes one space dimension and one time dimension, so I would like to pass a spatial vector for each of my variables to the ODE solver. For this I defined a custom struct to hold all of the information. I also would like my my program to be fast, so I studied up on the general methods to improve performance in DiffEq (Optimizing DiffEq Code), one of which is to reduce allocation by using an in-place form for the function at calculates you derivatives. By using the in-place form, the solver (presumably) uses broadcasting behind the scenes to reduce allocation. I did this, but my performance was way worse, and allocations increased. I tore my hair out for a long time until I realized that the problem stems from how broadcasting is achieved.
Here is a minimal working example that demonstrates this. I define my struct, and then define all of the interfaces as defined in (Interfaces · The Julia Language) for DiffEq. I then benchmark in-place broadcasting of my custom struct:
module temp
using BenchmarkTools
struct vars{T} <: AbstractArray{T,2}
vec1::Vector{T}
vec2::Vector{T}
vec3::Vector{T}
vec4::Vector{T}
vec5::Vector{T}
vec6::Vector{T}
vec7::Vector{T}
vec8::Vector{T}
vec9::Vector{T}
vec10::Vector{T}
vec11::Vector{T}
vec12::Vector{T}
end
cont(x::vars) = (x.vec1,x.vec2,x.vec3,x.vec4,x.vec5,x.vec6,x.vec7,x.vec8,x.vec9,x.vec10,x.vec11,x.vec12)
numvar=12
# Iteration
Base.IteratorSize(::Type{<:vars}) = Iterators.HasShape{2}()
Base.eltype(::Type{vars{T}}) where T = T
Base.isempty(x::vars) = isempty(x.vec1)
function Base.iterate(x::vars, state...)
return iterate(Iterators.flatten(cont(x)), state...)
end
Base.size(x::vars) = (length(x.vec1), numvar)
Base.size(x::vars, d) = size(x)[d]
# Indexing
function lin2cart(x::vars, i::Number)
n = length(x.vec1)
return (i - 1) % n + 1, (i - 1) ÷ n + 1
end
Base.getindex(x::vars, i) = getindex(x, i.I...)
Base.getindex(x::vars, i::Number) = getindex(x, lin2cart(x, i)...)
Base.getindex(x::vars, i, j) = getindex(cont(x)[j], i)
Base.setindex!(x::vars, v, i) = setindex!(x, v, i.I...)
Base.setindex!(x::vars, v, i::Number) = setindex!(x, v, lin2cart(x, i))
Base.setindex!(x::vars, v, i, j) = setindex!(cont(x)[j], v, i)
# Abstract Array
Base.IndexStyle(::vars) = IndexCartesian()
Base.similar(x::vars) = vars(map(similar, cont(x))...)
function Base.similar(x::vars, ::Type{T}) where {T}
return vars(map(y -> similar(y,T), cont(x))...)
end
Base.similar(x::vars, ::Dims) = similar(x)
Base.similar(x::vars, ::Dims, ::Type{T}) where {T} = similar(x, T)
#Broadcasting
Base.BroadcastStyle(::Type{<:vars}) = Broadcast.ArrayStyle{vars}()
function Base.similar(bc::Broadcast.Broadcasted{Broadcast.ArrayStyle{vars}},
::Type{T}) where {T}
return similar(find_vars(bc), T)
end
find_vars(bc::Base.Broadcast.Broadcasted) = find_vars(bc.args)
find_vars(args::Tuple) = find_vars(find_vars(args[1]), Base.tail(args))
find_vars(x) = x
find_vars(::Tuple{}) = nothing
find_vars(a::vars, rest) = a
find_vars(::Any, rest) = find_vars(rest)
function Base.similar(::Type{vars{T}},size::Int) where T
return vars{T}(
similar(Vector{T}(undef,size)),
similar(Vector{T}(undef,size)),
similar(Vector{T}(undef,size)),
similar(Vector{T}(undef,size)),
similar(Vector{T}(undef,size)),
similar(Vector{T}(undef,size)),
similar(Vector{T}(undef,size)),
similar(Vector{T}(undef,size)),
similar(Vector{T}(undef,size)),
similar(Vector{T}(undef,size)),
similar(Vector{T}(undef,size)),
similar(Vector{T}(undef,size))
)
end
function main()
T = Float64
n = 4000
vars1 = similar(vars{T},n)
vars2 = similar(vars{T},n)
@btime $vars1.vec1 .= $vars2.vec2
@btime $vars1 .= $vars2
return
end
end
This results in
julia> temp.main()
2.189 μs (0 allocations: 0 bytes)
40.746 ms (288040 allocations: 16.11 MiB)
The broadcasting here should not allocate at all, and certainly shouldn’t take milliseconds to accomplish (I wouldn’t think). If I decrease the number of variables in my struct to 3 and set n=1000 I get instead
julia> temp.main()
1.300 μs (0 allocations: 0 bytes)
3.975 μs (0 allocations: 0 bytes)
which is the intended outcome, so somehow the number of variables or the number of bytes to move around plays a role. The end goal here is to pass this struct to an ODESolver, so I need some way defining Broadcasting so that I get the desired outcome, but I do not understand the custom broadcasting outlined in the docs, as all I have done here is essentially copy and paste the example from the docs for defining the broadcasting interface for the custom struct.
Thanks in advance.