Help - not so zero-cost abstraction

My problem is about an interface, and how to get the compiler to optimise it away.

Say I have a Matrix with 6 rows and N columns - rows 1-3 represent position and rows 4-6 are velocity; each column is a body in a simulation.

Written by hand, I’d like this simple code:

# Copy positional components into velocity (ignore physics for the moment)
function direct_access(bodies_state::BlockState)
    state = bodies_state.state      # This is the raw `Matrix`
    for i in axes(state, 2)
        state[4, i] = state[1, i]
        state[5, i] = state[2, i]
        state[6, i] = state[3, i]
    end
    nothing
end

To be what the compiler would optimise to from this interface:

function property_access(bodies_state::BlockState)
    for i in axes(bodies_state.state, 2)
        bodies_state.velocity[1, i] = bodies_state.position[1, i]
        bodies_state.velocity[2, i] = bodies_state.position[2, i]
        bodies_state.velocity[3, i] = bodies_state.position[3, i]
    end
    nothing
end

In short, I want to think in terms of logical substates (position, velocity, other ones), but optimise down to direct indices (i.e. remap .velocity[1, icol] to .state[4, icol] at compile time).

Pre-emptively answering XY-problem questions:

  • I need this interface to be general enough to support different substates depending on the simulation, so I cannot hardcode just the names position and velocity.
  • I want to avoid making views, as parts of this code will run on a GPU, where views do have some cost associated with them; they also incur some extra pointer arithmetic.
  • This format is closer to an array of structures than a structure of arrays, but there is complicated logic I need to do on each body, so it’s better to keep the per-body components close in memory.

That said, I think this is a general enough interface style that its implementation would be useful to more people.

What I tried to create is below; I thought the compiler would have enough information to optimise the interface away, but benchmarking and Cthulhu show otherwise. If anyone could help me 1) understand where optimisations fail and why, and 2) how to fix this - I’d really appreciate it.

First, we can encode the range of rows spanned by a substate at the type-level:

struct ValRange{START, STOP}
    start::Val{START}
    stop::Val{STOP}
end

ValRange(start::Integer, stop::Integer) = ValRange(Val(start), Val(stop))

And we can create a BlockSubstate containing a pointer to the raw full matrix and the associated ValRange for a given substate, and overload getindex and setindex to remap row indices:

struct BlockSubstate{S, START, STOP}
    state::S
    rows::ValRange{START, STOP}
end

@inline function Base.getindex(
    s::BlockSubstate{S, START, STOP},
    irow::Integer,
    icol::Integer,
) where {S, START, STOP}
    s.state[START - 1 + irow, icol]
end

@inline function Base.setindex!(
    s::BlockSubstate{S, START, STOP},
    x,
    irow::Integer,
    icol::Integer,
) where {S, START, STOP}
    s.state[START - 1 + irow, icol] = x
end

And we can save the whole BlockState with the raw matrix and a NamedTuple mapping the substate names to a given ValRange of rows. Then we can overload getproperty to create a BlockSubstate when accessing a substate:

struct BlockState{M <: AbstractMatrix, D <: NamedTuple}
    state::M
    substates::D
end

@inline function Base.getproperty(s::BlockState, name::Symbol)
    if name === :state
        getfield(s, :state)
    elseif name === :substates
        getfield(s, :substates)
    else
        # Access a given ValRange from the .substates NamedTuple
        irange = getfield(s, :substates)[name]
        state = getfield(s, :state)
        return BlockSubstate(state, irange)
    end
end

So now we can create a state Matrix of, say, 1000 bodies, with 4 substates and wrap them up:

state = rand(Float32, 13, 1000)
substates = (
    position=ValRange(1, 3),
    velocity=ValRange(4, 6),
    orientation=ValRange(7, 10),
    angular_velocity=ValRange(11, 13),
)

bodies_state = BlockState(state, substates)

There it is, interface implemented; when we write bodies_state.velocity[1, icol], all required information should be known at compile-time: .velocity is translated to getproperty(bodies_state, :velocity), which we use to access the substates::NamedTuple, which should have the fields known at compile-time, create a temporary BlockSubstate which we then have the overloaded getindex on, and use the type-level values from the ValRange{4, 6} to remap 1 to START - 1 + 1, and ideally compile all that away into a simple state[4, icol].

That is not what happens:

using BenchmarkTools

println("Benchmarking direct access:")
display(@benchmark direct_access($bodies_state))

println("Benchmarking property access:")
display(@benchmark property_access($bodies_state))

# output
Benchmarking direct access:
BenchmarkTools.Trial: 10000 samples with 269 evaluations per sample.
 Range (min … max):  293.059 ns … 508.364 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     318.773 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   318.341 ns Β±  17.491 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      ▂▃▁      β–β–…β–ˆβ–ˆβ–†β–‚                                            
  β–β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–…β–ƒβ–ƒβ–‚β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  293 ns           Histogram: frequency by time          397 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Benchmarking property access:
BenchmarkTools.Trial: 5998 samples with 1 evaluation per sample.
 Range (min … max):  787.292 ΞΌs …  1.928 ms  β”Š GC (min … max): 0.00% … 57.71%
 Time  (median):     818.062 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   832.980 ΞΌs Β± 47.899 ΞΌs  β”Š GC (mean Β± Οƒ):  0.23% Β±  1.98%

   β–ˆβ–…               β–ƒ                                           
  β–…β–ˆβ–ˆβ–…β–„β–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–β–‚β–‚β–‚β–β–β–…β–ˆβ–‡β–„β–ƒβ–ƒβ–‚β–‚β–ƒβ–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  787 ΞΌs          Histogram: frequency by time         1.01 ms <

 Memory estimate: 93.75 KiB, allocs estimate: 6000.

The code above is self-contained; I also added it to a Google Colab Notebook you can run.

The odd part is that Cthulhu shows type inference to work well, for example:

...
105         bodies_state::BlockState{Matrix{Float32}, @NamedTuple{position::ValRange{1, 3}, velocity::ValRange{4, 6}, orientation::ValRange{7, 10}, angular_velocity::ValRange{11, 13}}}.velocity::BlockSubstate{Matrix{Float32}, 4, 6}[1, i] = bodies_state::BlockState{Matrix{Float32}, @NamedTuple{position::ValRange{1, 3}, velocity::ValRange{4, 6}, orientation::ValRange{7, 10}, angular_velocity::ValRange{11, 13}}}.position::BlockSubstate{Matrix{Float32}, 1, 3}[1, i]::Float32
...

Using AllocCheck on property_access I see:

Allocating runtime call to "jl_get_pgcstack_static" in unknown location

Which I don’t really know how to interpret.

The typed code shows:

...
β”‚   %12 =   dynamic Base.getproperty(bodies_state, :position)::BlockSubstate{Matrix{Float32}, 1, 3}
β”‚   %13 = i::Int64
β”‚   %14 =   dynamic Base.getindex(%12, 1, %13)::Float32
β”‚   %15 =   dynamic Base.getproperty(bodies_state, :velocity)::BlockSubstate{Matrix{Float32}, 4, 6}
β”‚   %16 = i::Int64
β”‚           dynamic Base.setindex!(%15, %14, 1, %16)::Any
...

But the assembly shows a lot more indirection and branching (and a lot more instructions) for property_access than direct_access.

That’s as far as I got. Anyone have a clue what’s happening here?

1 Like

There’s a lot going on up there and it would take a lot of looking for me to figure out a technical answer to your question. Personally, I hate trying to modify getproperty and would suggest you make a different function to do what you need instead of it.

Is it necessary to store the values in the columns of a matrix? While this is idiomatic in Python/MATLAB, I find that Julia enables much nicer representations. In particular, I would use an actual array-of-structs.

The rest of this post is providing some ideas for the structs you might put in such a Vector-of-structs.


What if you defined the states a little differently? Here I have used StaticArrays because the performance and abstractions it provides are extremely nice for 3D data, but you could use normal Vectors. Note that SVector is immutable so you must replace rather than update it. This sounds inconvenient initially, but with just a little practice I find immutables much easier to work with. The performance characteristics are a touch different, but for this sort of thing I find that in-net it can be much faster.

using StaticArrays

struct State{NT}
    substates::NT
end

State((; position=SVector(8e0,9,10), velocity=SVector(1e0,0,0), orientation=SVector(0e0,0,0)))

function update(x::State, props::NamedTuple)
    subst = x.substates
    foreach(keys(props)) do k
        k in keys(subst) || error("property ", k, " is not a part of this State")
    end
    State(merge(x.substates, props))
end
julia> s = State((; position=SVector(8e0,9,10), velocity=SVector(1e0,0,0), orientation=SVector(0e0,0,0)))
State{@NamedTuple{position::SVector{3, Float64}, velocity::SVector{3, Float64}, orientation::SVector{3, Float64}}}((position = [8.0, 9.0, 10.0], velocity = [1.0, 0.0, 0.0], orientation = [0.0, 0.0, 0.0]))

julia> update(s, (; velocity=s.substates.position))
State{@NamedTuple{position::SVector{3, Float64}, velocity::SVector{3, Float64}, orientation::SVector{3, Float64}}}((position = [8.0, 9.0, 10.0], velocity = [8.0, 9.0, 10.0], orientation = [0.0, 0.0, 0.0]))

and you can store these directly in a Vector{typeof(s)}.


Personally, I would go even further and try to build a more robust system using an overarching abstract type (which is irrelevant to the code below, but would become relevant when you defined different subtypes but wanted some methods to be shared by all of them):

using StaticArrays

abstract type AbstractState end

# make a custom <:AbstractState whenever you want a different set of fields
# alternatively, you can parameterize things on the number of derivatives using NTuples, but I won't go into that here
struct PVOVState{T} <: AbstractState
    pos::SVector{3,T}
    vel::SVector{3,T}
    orient::SVector{3,T}
    angvel::SVector{3,T}
end

# the relevant getters -- not all getters will be applicable to all AbstractStates
positionof(x::PVOVState) = x.pos
velocityof(x::PVOVState) = x.vel
orientationof(x::PVOVState) = x.orient
angularvelocityof(x::PVOVState) = x.angvel
asvector(x::PVOVState) = vcat(x.pos, x.vel, x.orient, x.angvel) # stack all values - maybe also make a PVOV constructor to reverse the process

# the relevant setters
setvelocity(x::PVOVState, v) = PVOVState(x.pos, v, x.orient, x.angvel)
# etc
julia> s = PVOVState(SVector(8e0,9,10), SVector(1e0,0,0), SVector(0e0,0,0), SVector(0e0,0,0))
PVOVState{Float64}([8.0, 9.0, 10.0], [1.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

julia> setvelocity(s, positionof(s))
PVOVState{Float64}([8.0, 9.0, 10.0], [8.0, 9.0, 10.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

and (again) store multiple states in a Vector{PVOVState{Float64}}.

3 Likes

I’m pretty sure it’s because union-splitting only works up to three types. For your original substates of four entries, you get

julia> @btime $bodies_state.position;
  201.724 ns (1 allocation: 16 bytes)

and in @code_warntype bodies_state.position there is

β”‚   %17 = irange::ValRange
β”‚   %18 = Main.BlockSubstate(%16, %17)::BlockSubstate{Matrix{Float32}}

On the other hand, with three entries, e.g.

substates = (
    position=ValRange(1, 3),
    velocity=ValRange(4, 6),
    orientation=ValRange(7, 13),
)

you get

julia> @btime $bodies_state.position;
  2.100 ns (0 allocations: 0 bytes)

and

β”‚   %17 = irange::Union{ValRange{1, 3}, ValRange{4, 6}, ValRange{7, 13}}
β”‚   %18 = Main.BlockSubstate(%16, %17)::Union{BlockSubstate{Matrix{Float32}, 1, 3}, BlockSubstate{Matrix{Float32}, 4, 6}, BlockSubstate{Matrix{Float32}, 7, 13}}

property_access is also as fast as direct_access.

As on how to fix this, I guess you need to force something like constant propagation, but I’m not knowledgeable enough on these things, so can’t really help. All I can tell you is that adding a Base.@constprop :aggressive does not seem to help :).

2 Likes

A few more comments:

You don’t need fields here. The type parameters are enough:

struct ValRange{START, STOP}
end

get_start(::ValRange{Start, Stop}) = Start
get_stop(::ValRange{Start, Stop}) = Stop

I second @mikmoore that there are probably nicer primitive available in Julia. Have a look at StructArrays.jl or maybe even the rather new Ark.jl.

Alternatively, when you really really want to make sure that something gets fully inlined/compiled away then you could look toward @generated functions:

I second the suggestion to look at Ark.jl for this type of problem. Especially if you have different sets of state variables depending on the simulation, that library should make things easier for you.

Thanks @eldee for catching that the problem was union-splitting when accessing the NamedTuple field! I was able to fix this in two ways - here they are in case someone else has this problem:

First, modify the getproperty to get the keys and values in the NamedTuple as individual Tuples:

@inline function Base.getproperty(s::BlockState, name::Symbol)
    if name === :state
        getfield(s, :state)
    elseif name === :substates
        getfield(s, :substates)
    else
        state = getfield(s, :state)
        substates = getfield(s, :substates)
        return @inline _get_block_substate(keys(substates), values(substates), name, state)
end

To β€œforce” constant propagation we can do it recursively and inline the checks:

function _get_block_substate(names::Tuple{}, substates::Tuple{}, name::Symbol, state)
    throw(ArgumentError("No such substate: $name"))
end

function _get_block_substate(names::Tuple, substates::Tuple, name::Symbol, state)
    if names[1] === name
        return BlockSubstate(state, substates[1])
    end
    return @inline _get_block_substate(Base.tail(names), Base.tail(substates), name, state)
end

Or use Unrolled.jl to create a generated function with separate checks:

@unroll function _get_block_substate(names::Tuple, substates::Tuple, name::Symbol, state)
    @unroll for i in 1:length(names)
        if names[i] === name
            return BlockSubstate(state, substates[i])
        end
    end
    throw(ArgumentError("No such substate: $name"))
end

Creating the BlockSubstate at the point of finding the substate name avoids the union splitting from substates[fieldname] - and this is all done at compile time; now both property_access and direct_access take the same amount of time.


Replying to the other points in this thread: in this simulation I need to do a lot of mutation on the states, so modelling them as immutable structs that I keep recreating and overwriting is really 1) (for sure) not pleasant, and 2) (I think) not as performant, especially when talking about millions or billions of bodies; it’s not just incrementing a position (which an integrator would do anyways), but thousands of lines of complex interactions between bodies. Then there are also types which cannot be directly modelled by this approach, such as triangular (2D) meshes, where you store vertices and connectivity separately, so a single struct Triangle doesn’t make sense as for a single struct Atom.

@mikmoore your solution still hardcodes the state names, either in the accessors or the fields in the struct; I need to be able to change them depending on the simulation without having to write new structs. @simsurace Ark.jl looks nice, but I’d still have to 1) define new structs for each simulation type, and 2) it is not threadsafe; I’ll need this to work on a GPU too. Backing this all by a single mutable matrix keeps writing into memory to a minimum, works directly with an integrator, and makes usage more natural for this specific usecase (I do think bodies_state.velocity[1, ibody] = ... is clearer than update(bodies_state, (; velocity=...))).


Now my question is why a compile-time known Symbol as in bodies_state.velocity doesn’t propagate into accessing a NamedTuple as in substates[name]? Why exactly does this trigger union-splitting?

Also, what exactly is jl_get_pgcstack_static and why does it pop up here? I didn’t really understand it just from the threading.c compiler code.

1 Like

This report is deceptive because @code_ specializes over the concrete argument types even when the compiler doesn’t. In this case, it fails to account for getproperty specializing over literal values of name::Symbol. By making a minimal caller method with a literal value, we can see the proper inference that is consistent with the larger @code_warntype property_access(bodies_state):

julia> getposition(x) = x.position
getposition (generic function with 1 method)

julia> @code_warntype getposition(bodies_state)
MethodInstance for getposition(::BlockState{Matrix{Float32}, @NamedTuple{position::ValRange{1, 3}, velocity::ValRange{4, 6}, orientation::ValRange{7, 10}, angular_velocity::ValRange{11, 13}}})
  from getposition(x) @ Main REPL[55]:1
Arguments
  #self#::Core.Const(Main.getposition)
  x::BlockState{Matrix{Float32}, @NamedTuple{position::ValRange{1, 3}, velocity::ValRange{4, 6}, orientation::ValRange{7, 10}, angular_velocity::ValRange{11, 13}}}
Body::BlockSubstate{Matrix{Float32}, 1, 3}
1 ─ %1 = Base.getproperty(x, :position)::BlockSubstate{Matrix{Float32}, 1, 3}
└──      return %1

That one allocation is just a dynamic dispatch. Not sure why that happens (and @code_ not specializing on the getproperty call accurately is part of the reason) but it seems related to how many elements substates has, as you demonstrated.

julia> @code_llvm getposition(bodies_state)
; Function Signature: getposition(Main.BlockState{Array{Float32, 2}, NamedTuple{(:position, :velocity, :orientation, :angular_velocity), Tuple{Main.ValRange{1, 3}, Main.ValRange{4, 6}, Main.ValRange{7, 10}, Main.ValRange{11, 13}}}})
;  @ REPL[46]:1 within `getposition`
; Function Attrs: uwtable
define { ptr } @julia_getposition_8820(ptr nocapture noundef nonnull readonly align 8 dereferenceable(8) %"x::BlockState") #0 {
top:
  %jlcallframe1 = alloca [2 x ptr], align 8
; β”Œ @ REPL[10]:9 within `getproperty`
   %"x::BlockState.state" = load atomic ptr, ptr %"x::BlockState" unordered, align 8
; β”‚ @ REPL[10]:10 within `getproperty`
   store ptr %"x::BlockState.state", ptr %jlcallframe1, align 8
   %0 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
   store ptr @"jl_global#8824.jit", ptr %0, align 8
   %1 = call nonnull ptr @ijl_apply_generic(ptr nonnull @"jl_global#8823.jit", ptr nonnull %jlcallframe1,
 i32 2)
   %.unbox.unpack = load ptr, ptr %1, align 8
   %.unbox1 = insertvalue { ptr } zeroinitializer, ptr %.unbox.unpack, 0
   ret { ptr } %.unbox1
; β””
}

6 calls/column x 1000 columns = 6000 calls, matching the 6000 allocations. 787us/6000 calls = 131ns/call, which is within the range of my dynamic dispatch benchmark attempt (10ns looping over same minimal method, 180ns looping over randomized methods, though it seemed to scale with number of methods involved) a year ago in v1.11.2.

For comparison, an absence of a dynamic dispatch is an absence of %jlcallframe and @ijl_apply_generic:

julia> @code_llvm getposition((a=1,b=2.1,position=3im,d=4//1,e=:five))
; Function Signature: getposition(NamedTuple{(:a, :b, :position, :d, :e), Tuple{Int64, Float64, Base.Complex{Int64}, Base.Rational{Int64}, Symbol}})
;  @ REPL[55]:1 within `getposition`
; Function Attrs: uwtable
define void @julia_getposition_9170(ptr noalias nocapture noundef nonnull sret([2 x i64]) align 8 dereferenceable(16) %sret_return, ptr nocapture noundef nonnull readonly align 8 dereferenceable(56) %"x::NamedTuple", ptr nocapture readonly %.roots.x) #0 {
top:
; β”Œ @ Base_compiler.jl:54 within `getproperty`
   %0 = getelementptr inbounds i8, ptr %"x::NamedTuple", i64 16
   call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(16) %sret_return, ptr noundef nonnull align 8 dereferenceable(16) %0, i64 16, i1 false)
   ret void
; β””
}

julia> @btime getposition($(a=1,b=2.1,position=3im,d=4//1,e=:five))
  2.100 ns (0 allocations: 0 bytes)
0 + 3im
2 Likes

I don’t really understand this complaint. You’re already hardcoding the property names into the program you want to write too:

function property_access(bodies_state::BlockState)
    for i in axes(bodies_state.state, 2)
        bodies_state.velocity[1, i] = bodies_state.position[1, i]
        bodies_state.velocity[2, i] = bodies_state.position[2, i]
        bodies_state.velocity[3, i] = bodies_state.position[3, i]
    end
    nothing
end

i.e. you’re explicitly writing .velocity, and .position anyways

This is more a question about design - the code above is part of a larger codebase, but I’ll try to explain the key points below.

In terms of usage, I want the user-facing interface to be declarative with mix-and-match physical models, say:

particles = ...
models = (Gravity, SomeRepulsiveForce)
sim = Simulation(particles, models, ...)

Where Gravity and SomeRepulsiveForce register that they need position and velocity as substates. But you can also specify:

models = (Gravity, SomeRepulsiveForce, SomeRotationalForce)

Which will also need orientation and angular_momentum; add SomeHeatInteraction to include temperature as a substate too; add LiquidFilmInteraction (+:coating) and remove SomeRotationalForce (-:orientation,-:angular_velocity) - you get the idea: mix and match and the simulation will be set up with just the substates that are declared as needed by the models.

In their implementation, the models do something like:

struct SomeRepulsiveForce
    ...
end

# Register substates required and their numbers of degrees of freedom -
# that is, the number of rows in the `state` matrix we'll integrate
register_interaction_substates(::SomeRepulsiveForce) = (
    :position => 3,
    :velocity => 3,
)

# When implementing the interaction behaviour, we know we'll have
# state.position and state.velocity available to us
function evaluate_interaction(model::SomeRepulsiveForce, dstate::BlockState, ibody)
    dstate.velocity[1, ibody] += model.constant * ...    # add force
    ...
end

Overall, this architecture means that:

  1. Setting up a simulation case is declarative: just name the specific physical models you want to use.
  2. The simulation is set up with just the substates - i.e. rows in the state Matrix - it will need.
  3. The physical model implementations are imperative and work directly on the backing matrix; this minimises global memory writes and eliminates allocations*. They’re as efficient as what you’d write in C - and easy to read/write.
  4. You don’t hardcode the number of substates, their exact row indices and order, nor declare new structs ahead of time; when the simulation is created, it encodes into BlockState via that .substates::NamedTuple of ValRange what indices represent - so you can implement interactions where you do write bodies_state.velocity[1, i] but it gets converted at compile-time into bodies_state.raw_state_matrix[<VelocityRowIndex1>, i], regardless of whether you also have temperature or orientation or other substates in there - the compiled code is as if you manually wrote the right indices.
  5. The backing raw state Matrix works directly with DifferentialEquations.jl integrators; it can well be a CuMatrix or MtlMatrix or other GPU array types.

*Of course you could write non-allocating code with immutable structs depending on how you structure it, but if individual physical models work on different substates (Model1 only updates velocity, Model2 only updates temperature, etc.) you either have to keep overwriting the whole struct with just a few new values, or have a temporary mutable buffer that you modify the relevant parts of and do a batch write at the end; writing atomically in parallel contexts is more difficult too. You avoid all this by just having a single backing Matrix that you index into by the logical substates as above.

This sounds a bit more sales-y than I’d like, but these are the needs and constraints that led to the above design decisions - and why figuring out how to turn everything into compile-time known indices (i.e. the zero-cost abstraction) in this thread mattered; which we did with your help :slight_smile:

So here’s a weird thing. If I start from a fresh session on v1.12.2

and run most of the original post's code plus the `getposition` caller, expand here for a pastable begin block...
       begin
       struct ValRange{START, STOP}
           start::Val{START}
           stop::Val{STOP}
       end

       ValRange(start::Integer, stop::Integer) = ValRange(Val(start), Val(stop))
       struct BlockSubstate{S, START, STOP}
           state::S
           rows::ValRange{START, STOP}
       end

       @inline function Base.getindex(
           s::BlockSubstate{S, START, STOP},
           irow::Integer,
           icol::Integer,
       ) where {S, START, STOP}
           s.state[START - 1 + irow, icol]
       end

       @inline function Base.setindex!(
           s::BlockSubstate{S, START, STOP},
           x,
           irow::Integer,
           icol::Integer,
       ) where {S, START, STOP}
           s.state[START - 1 + irow, icol] = x
       end
       struct BlockState{M <: AbstractMatrix, D <: NamedTuple}
           state::M
           substates::D
       end

       @inline function Base.getproperty(s::BlockState, name::Symbol)
           if name === :state
               getfield(s, :state)
           elseif name === :substates
               getfield(s, :substates)
           else
               # Access a given ValRange from the .substates NamedTuple
               irange = getfield(s, :substates)[name]
               state = getfield(s, :state)
               return BlockSubstate(state, irange)
           end
       end
       # Copy positional components into velocity (ignore physics for the moment)
       function direct_access(bodies_state::BlockState)
           state = bodies_state.state      # This is the raw `Matrix`
           for i in axes(state, 2)
               state[4, i] = state[1, i]
               state[5, i] = state[2, i]
               state[6, i] = state[3, i]
           end
           nothing
       end
       function property_access(bodies_state::BlockState)
           for i in axes(bodies_state.state, 2)
               bodies_state.velocity[1, i] = bodies_state.position[1, i]
               bodies_state.velocity[2, i] = bodies_state.position[2, i]
               bodies_state.velocity[3, i] = bodies_state.position[3, i]
           end
           nothing
       end
       state = rand(Float32, 13, 1000)
       substates = (
           position=ValRange(1, 3),
           velocity=ValRange(4, 6),
           orientation=ValRange(7, 10),
           angular_velocity=ValRange(11, 13),
       )

       bodies_state = BlockState(state, substates)
       getposition(x) = x.position
       using BenchmarkTools
       end

then run a benchmark, we see the extra allocations from getproperty:

julia> begin
       @btime getposition($bodies_state)
       @btime property_access($bodies_state)
       @btime direct_access($bodies_state)
       end
  212.037 ns (1 allocation: 16 bytes)
  1.441 ms (6000 allocations: 93.75 KiB)
  567.935 ns (0 allocations: 0 bytes)

But if I then undo the @inline call for the getproperty method, I get 0 allocations and we don’t see the dynamic dispatch shown previously:

julia> function Base.getproperty(s::BlockState, name::Symbol)
           if name === :state
               getfield(s, :state)
           elseif name === :substates
               getfield(s, :substates)
           else
               # Access a given ValRange from the .substates NamedTuple
               irange = getfield(s, :substates)[name]
               state = getfield(s, :state)
               return BlockSubstate(state, irange)
           end
       end

julia> begin
       @btime getposition($bodies_state)
       @btime property_access($bodies_state)
       @btime direct_access($bodies_state)
       end
  1.800 ns (0 allocations: 0 bytes)
  567.391 ns (0 allocations: 0 bytes)
  566.848 ns (0 allocations: 0 bytes)

julia> @code_llvm getposition(bodies_state)
; Function Signature: getposition(Main.BlockState{Array{Float32, 2}, NamedTuple{(:position, :velocity, :orientation, :angular_velocity), Tuple{Main.ValRange{1, 3}, Main.ValRange{4, 6}, Main.ValRange{7, 10}, Main.ValRange{11, 13}}}})
;  @ REPL[1]:73 within `getposition`
; Function Attrs: uwtable
define { ptr } @julia_getposition_4459(ptr nocapture noundef nonnull readonly align 8 dereferenceable(8) %"x::BlockState") #0 {
top:
; β”Œ @ REPL[3]:9 within `getproperty`
   %"x::BlockState.state" = load atomic ptr, ptr %"x::BlockState" unordered, align 8
; β”‚ @ REPL[3]:10 within `getproperty`
; β”‚β”Œ @ REPL[1]:9 within `BlockSubstate`
    %.unbox.fca.0.insert = insertvalue { ptr } zeroinitializer, ptr %"x::BlockState.state", 0
    ret { ptr } %.unbox.fca.0.insert
; β””β””
}

I have zero idea why @inline would sabotage this optimization, but at least this fits the ability of the compiler to infer the output type given a literal :position. Might be worth an issue if it could be reproduced and reduced?

1 Like

I can indeed reproduce this on Julia 1.12.1 (though not on 1.11.5, where I always get dynamic dispatch, regardless of @inline).

1 Like

I’m not trying to convince you, bit I would genuinely be interested to see how Ark.jl fares on this use case. I think what you created is very similar, and perhaps your knowledge of the structure of the problem makes it possible to outperform a generic framework, but I believe Ark is at least trying to be generic and performant. @mlange-42 may be able to give a better assessment than myself. I recently refactored a simulation by making it a wrapper around Ark API, on the outside it looks almost identical to what you describe. It got more performant and much more flexible than what I had before, although I didnβ€˜t go to such great lengths as you to optimize the previous design.

1 Like