Help - not so zero-cost abstraction

My problem is about an interface, and how to get the compiler to optimise it away.

Say I have a Matrix with 6 rows and N columns - rows 1-3 represent position and rows 4-6 are velocity; each column is a body in a simulation.

Written by hand, I’d like this simple code:

# Copy positional components into velocity (ignore physics for the moment)
function direct_access(bodies_state::BlockState)
    state = bodies_state.state      # This is the raw `Matrix`
    for i in axes(state, 2)
        state[4, i] = state[1, i]
        state[5, i] = state[2, i]
        state[6, i] = state[3, i]
    end
    nothing
end

To be what the compiler would optimise to from this interface:

function property_access(bodies_state::BlockState)
    for i in axes(bodies_state.state, 2)
        bodies_state.velocity[1, i] = bodies_state.position[1, i]
        bodies_state.velocity[2, i] = bodies_state.position[2, i]
        bodies_state.velocity[3, i] = bodies_state.position[3, i]
    end
    nothing
end

In short, I want to think in terms of logical substates (position, velocity, other ones), but optimise down to direct indices (i.e. remap .velocity[1, icol] to .state[4, icol] at compile time).

Pre-emptively answering XY-problem questions:

  • I need this interface to be general enough to support different substates depending on the simulation, so I cannot hardcode just the names position and velocity.
  • I want to avoid making views, as parts of this code will run on a GPU, where views do have some cost associated with them; they also incur some extra pointer arithmetic.
  • This format is closer to an array of structures than a structure of arrays, but there is complicated logic I need to do on each body, so it’s better to keep the per-body components close in memory.

That said, I think this is a general enough interface style that its implementation would be useful to more people.

What I tried to create is below; I thought the compiler would have enough information to optimise the interface away, but benchmarking and Cthulhu show otherwise. If anyone could help me 1) understand where optimisations fail and why, and 2) how to fix this - I’d really appreciate it.

First, we can encode the range of rows spanned by a substate at the type-level:

struct ValRange{START, STOP}
    start::Val{START}
    stop::Val{STOP}
end

ValRange(start::Integer, stop::Integer) = ValRange(Val(start), Val(stop))

And we can create a BlockSubstate containing a pointer to the raw full matrix and the associated ValRange for a given substate, and overload getindex and setindex to remap row indices:

struct BlockSubstate{S, START, STOP}
    state::S
    rows::ValRange{START, STOP}
end

@inline function Base.getindex(
    s::BlockSubstate{S, START, STOP},
    irow::Integer,
    icol::Integer,
) where {S, START, STOP}
    s.state[START - 1 + irow, icol]
end

@inline function Base.setindex!(
    s::BlockSubstate{S, START, STOP},
    x,
    irow::Integer,
    icol::Integer,
) where {S, START, STOP}
    s.state[START - 1 + irow, icol] = x
end

And we can save the whole BlockState with the raw matrix and a NamedTuple mapping the substate names to a given ValRange of rows. Then we can overload getproperty to create a BlockSubstate when accessing a substate:

struct BlockState{M <: AbstractMatrix, D <: NamedTuple}
    state::M
    substates::D
end

@inline function Base.getproperty(s::BlockState, name::Symbol)
    if name === :state
        getfield(s, :state)
    elseif name === :substates
        getfield(s, :substates)
    else
        # Access a given ValRange from the .substates NamedTuple
        irange = getfield(s, :substates)[name]
        state = getfield(s, :state)
        return BlockSubstate(state, irange)
    end
end

So now we can create a state Matrix of, say, 1000 bodies, with 4 substates and wrap them up:

state = rand(Float32, 13, 1000)
substates = (
    position=ValRange(1, 3),
    velocity=ValRange(4, 6),
    orientation=ValRange(7, 10),
    angular_velocity=ValRange(11, 13),
)

bodies_state = BlockState(state, substates)

There it is, interface implemented; when we write bodies_state.velocity[1, icol], all required information should be known at compile-time: .velocity is translated to getproperty(bodies_state, :velocity), which we use to access the substates::NamedTuple, which should have the fields known at compile-time, create a temporary BlockSubstate which we then have the overloaded getindex on, and use the type-level values from the ValRange{4, 6} to remap 1 to START - 1 + 1, and ideally compile all that away into a simple state[4, icol].

That is not what happens:

using BenchmarkTools

println("Benchmarking direct access:")
display(@benchmark direct_access($bodies_state))

println("Benchmarking property access:")
display(@benchmark property_access($bodies_state))

# output
Benchmarking direct access:
BenchmarkTools.Trial: 10000 samples with 269 evaluations per sample.
 Range (min … max):  293.059 ns … 508.364 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     318.773 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   318.341 ns Β±  17.491 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      ▂▃▁      β–β–…β–ˆβ–ˆβ–†β–‚                                            
  β–β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–…β–ƒβ–ƒβ–‚β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  293 ns           Histogram: frequency by time          397 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Benchmarking property access:
BenchmarkTools.Trial: 5998 samples with 1 evaluation per sample.
 Range (min … max):  787.292 ΞΌs …  1.928 ms  β”Š GC (min … max): 0.00% … 57.71%
 Time  (median):     818.062 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   832.980 ΞΌs Β± 47.899 ΞΌs  β”Š GC (mean Β± Οƒ):  0.23% Β±  1.98%

   β–ˆβ–…               β–ƒ                                           
  β–…β–ˆβ–ˆβ–…β–„β–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–β–‚β–‚β–‚β–β–β–…β–ˆβ–‡β–„β–ƒβ–ƒβ–‚β–‚β–ƒβ–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  787 ΞΌs          Histogram: frequency by time         1.01 ms <

 Memory estimate: 93.75 KiB, allocs estimate: 6000.

The code above is self-contained; I also added it to a Google Colab Notebook you can run.

The odd part is that Cthulhu shows type inference to work well, for example:

...
105         bodies_state::BlockState{Matrix{Float32}, @NamedTuple{position::ValRange{1, 3}, velocity::ValRange{4, 6}, orientation::ValRange{7, 10}, angular_velocity::ValRange{11, 13}}}.velocity::BlockSubstate{Matrix{Float32}, 4, 6}[1, i] = bodies_state::BlockState{Matrix{Float32}, @NamedTuple{position::ValRange{1, 3}, velocity::ValRange{4, 6}, orientation::ValRange{7, 10}, angular_velocity::ValRange{11, 13}}}.position::BlockSubstate{Matrix{Float32}, 1, 3}[1, i]::Float32
...

Using AllocCheck on property_access I see:

Allocating runtime call to "jl_get_pgcstack_static" in unknown location

Which I don’t really know how to interpret.

The typed code shows:

...
β”‚   %12 =   dynamic Base.getproperty(bodies_state, :position)::BlockSubstate{Matrix{Float32}, 1, 3}
β”‚   %13 = i::Int64
β”‚   %14 =   dynamic Base.getindex(%12, 1, %13)::Float32
β”‚   %15 =   dynamic Base.getproperty(bodies_state, :velocity)::BlockSubstate{Matrix{Float32}, 4, 6}
β”‚   %16 = i::Int64
β”‚           dynamic Base.setindex!(%15, %14, 1, %16)::Any
...

But the assembly shows a lot more indirection and branching (and a lot more instructions) for property_access than direct_access.

That’s as far as I got. Anyone have a clue what’s happening here?

1 Like

There’s a lot going on up there and it would take a lot of looking for me to figure out a technical answer to your question. Personally, I hate trying to modify getproperty and would suggest you make a different function to do what you need instead of it.

Is it necessary to store the values in the columns of a matrix? While this is idiomatic in Python/MATLAB, I find that Julia enables much nicer representations. In particular, I would use an actual array-of-structs.

The rest of this post is providing some ideas for the structs you might put in such a Vector-of-structs.


What if you defined the states a little differently? Here I have used StaticArrays because the performance and abstractions it provides are extremely nice for 3D data, but you could use normal Vectors. Note that SVector is immutable so you must replace rather than update it. This sounds inconvenient initially, but with just a little practice I find immutables much easier to work with. The performance characteristics are a touch different, but for this sort of thing I find that in-net it can be much faster.

using StaticArrays

struct State{NT}
    substates::NT
end

State((; position=SVector(8e0,9,10), velocity=SVector(1e0,0,0), orientation=SVector(0e0,0,0)))

function update(x::State, props::NamedTuple)
    subst = x.substates
    foreach(keys(props)) do k
        k in keys(subst) || error("property ", k, " is not a part of this State")
    end
    State(merge(x.substates, props))
end
julia> s = State((; position=SVector(8e0,9,10), velocity=SVector(1e0,0,0), orientation=SVector(0e0,0,0)))
State{@NamedTuple{position::SVector{3, Float64}, velocity::SVector{3, Float64}, orientation::SVector{3, Float64}}}((position = [8.0, 9.0, 10.0], velocity = [1.0, 0.0, 0.0], orientation = [0.0, 0.0, 0.0]))

julia> update(s, (; velocity=s.substates.position))
State{@NamedTuple{position::SVector{3, Float64}, velocity::SVector{3, Float64}, orientation::SVector{3, Float64}}}((position = [8.0, 9.0, 10.0], velocity = [8.0, 9.0, 10.0], orientation = [0.0, 0.0, 0.0]))

and you can store these directly in a Vector{typeof(s)}.


Personally, I would go even further and try to build a more robust system using an overarching abstract type (which is irrelevant to the code below, but would become relevant when you defined different subtypes but wanted some methods to be shared by all of them):

using StaticArrays

abstract type AbstractState end

# make a custom <:AbstractState whenever you want a different set of fields
# alternatively, you can parameterize things on the number of derivatives using NTuples, but I won't go into that here
struct PVOVState{T} <: AbstractState
    pos::SVector{3,T}
    vel::SVector{3,T}
    orient::SVector{3,T}
    angvel::SVector{3,T}
end

# the relevant getters -- not all getters will be applicable to all AbstractStates
positionof(x::PVOVState) = x.pos
velocityof(x::PVOVState) = x.vel
orientationof(x::PVOVState) = x.orient
angularvelocityof(x::PVOVState) = x.angvel
asvector(x::PVOVState) = vcat(x.pos, x.vel, x.orient, x.angvel) # stack all values - maybe also make a PVOV constructor to reverse the process

# the relevant setters
setvelocity(x::PVOVState, v) = PVOVState(x.pos, v, x.orient, x.angvel)
# etc
julia> s = PVOVState(SVector(8e0,9,10), SVector(1e0,0,0), SVector(0e0,0,0), SVector(0e0,0,0))
PVOVState{Float64}([8.0, 9.0, 10.0], [1.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

julia> setvelocity(s, positionof(s))
PVOVState{Float64}([8.0, 9.0, 10.0], [8.0, 9.0, 10.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

and (again) store multiple states in a Vector{PVOVState{Float64}}.

1 Like

I’m pretty sure it’s because union-splitting only works up to three types. For your original substates of four entries, you get

julia> @btime $bodies_state.position;
  201.724 ns (1 allocation: 16 bytes)

and in @code_warntype bodies_state.position there is

β”‚   %17 = irange::ValRange
β”‚   %18 = Main.BlockSubstate(%16, %17)::BlockSubstate{Matrix{Float32}}

On the other hand, with three entries, e.g.

substates = (
    position=ValRange(1, 3),
    velocity=ValRange(4, 6),
    orientation=ValRange(7, 13),
)

you get

julia> @btime $bodies_state.position;
  2.100 ns (0 allocations: 0 bytes)

and

β”‚   %17 = irange::Union{ValRange{1, 3}, ValRange{4, 6}, ValRange{7, 13}}
β”‚   %18 = Main.BlockSubstate(%16, %17)::Union{BlockSubstate{Matrix{Float32}, 1, 3}, BlockSubstate{Matrix{Float32}, 4, 6}, BlockSubstate{Matrix{Float32}, 7, 13}}

property_access is also as fast as direct_access.

As on how to fix this, I guess you need to force something like constant propagation, but I’m not knowledgeable enough on these things, so can’t really help. All I can tell you is that adding a Base.@constprop :aggressive does not seem to help :).

1 Like

A few more comments:

You don’t need fields here. The type parameters are enough:

struct ValRange{START, STOP}
end

get_start(::ValRange{Start, Stop}) = Start
get_stop(::ValRange{Start, Stop}) = Stop

I second @mikmoore that there are probably nicer primitive available in Julia. Have a look at StructArrays.jl or maybe even the rather new Ark.jl.

Alternatively, when you really really want to make sure that something gets fully inlined/compiled away then you could look toward @generated functions:

I second the suggestion to look at Ark.jl for this type of problem. Especially if you have different sets of state variables depending on the simulation, that library should make things easier for you.

Thanks @eldee for catching that the problem was union-splitting when accessing the NamedTuple field! I was able to fix this in two ways - here they are in case someone else has this problem:

First, modify the getproperty to get the keys and values in the NamedTuple as individual Tuples:

@inline function Base.getproperty(s::BlockState, name::Symbol)
    if name === :state
        getfield(s, :state)
    elseif name === :substates
        getfield(s, :substates)
    else
        state = getfield(s, :state)
        substates = getfield(s, :substates)
        return @inline _get_block_substate(keys(substates), values(substates), name, state)
end

To β€œforce” constant propagation we can do it recursively and inline the checks:

function _get_block_substate(names::Tuple{}, substates::Tuple{}, name::Symbol, state)
    throw(ArgumentError("No such substate: $name"))
end

function _get_block_substate(names::Tuple, substates::Tuple, name::Symbol, state)
    if names[1] === name
        return BlockSubstate(state, substates[1])
    end
    return @inline _get_block_substate(Base.tail(names), Base.tail(substates), name, state)
end

Or use Unrolled.jl to create a generated function with separate checks:

@unroll function _get_block_substate(names::Tuple, substates::Tuple, name::Symbol, state)
    @unroll for i in 1:length(names)
        if names[i] === name
            return BlockSubstate(state, substates[i])
        end
    end
    throw(ArgumentError("No such substate: $name"))
end

Creating the BlockSubstate at the point of finding the substate name avoids the union splitting from substates[fieldname] - and this is all done at compile time; now both property_access and direct_access take the same amount of time.


Replying to the other points in this thread: in this simulation I need to do a lot of mutation on the states, so modelling them as immutable structs that I keep recreating and overwriting is really 1) (for sure) not pleasant, and 2) (I think) not as performant, especially when talking about millions or billions of bodies; it’s not just incrementing a position (which an integrator would do anyways), but thousands of lines of complex interactions between bodies. Then there are also types which cannot be directly modelled by this approach, such as triangular (2D) meshes, where you store vertices and connectivity separately, so a single struct Triangle doesn’t make sense as for a single struct Atom.

@mikmoore your solution still hardcodes the state names, either in the accessors or the fields in the struct; I need to be able to change them depending on the simulation without having to write new structs. @simsurace Ark.jl looks nice, but I’d still have to 1) define new structs for each simulation type, and 2) it is not threadsafe; I’ll need this to work on a GPU too. Backing this all by a single mutable matrix keeps writing into memory to a minimum, works directly with an integrator, and makes usage more natural for this specific usecase (I do think bodies_state.velocity[1, ibody] = ... is clearer than update(bodies_state, (; velocity=...))).


Now my question is why a compile-time known Symbol as in bodies_state.velocity doesn’t propagate into accessing a NamedTuple as in substates[name]? Why exactly does this trigger union-splitting?

Also, what exactly is jl_get_pgcstack_static and why does it pop up here? I didn’t really understand it just from the threading.c compiler code.