My problem is about an interface, and how to get the compiler to optimise it away.
Say I have a Matrix with 6 rows and N columns - rows 1-3 represent position and rows 4-6 are velocity; each column is a body in a simulation.
Written by hand, Iβd like this simple code:
# Copy positional components into velocity (ignore physics for the moment)
function direct_access(bodies_state::BlockState)
state = bodies_state.state # This is the raw `Matrix`
for i in axes(state, 2)
state[4, i] = state[1, i]
state[5, i] = state[2, i]
state[6, i] = state[3, i]
end
nothing
end
To be what the compiler would optimise to from this interface:
function property_access(bodies_state::BlockState)
for i in axes(bodies_state.state, 2)
bodies_state.velocity[1, i] = bodies_state.position[1, i]
bodies_state.velocity[2, i] = bodies_state.position[2, i]
bodies_state.velocity[3, i] = bodies_state.position[3, i]
end
nothing
end
In short, I want to think in terms of logical substates (position, velocity, other ones), but optimise down to direct indices (i.e. remap .velocity[1, icol] to .state[4, icol] at compile time).
Pre-emptively answering XY-problem questions:
- I need this interface to be general enough to support different substates depending on the simulation, so I cannot hardcode just the names
positionandvelocity. - I want to avoid making views, as parts of this code will run on a GPU, where views do have some cost associated with them; they also incur some extra pointer arithmetic.
- This format is closer to an array of structures than a structure of arrays, but there is complicated logic I need to do on each body, so itβs better to keep the per-body components close in memory.
That said, I think this is a general enough interface style that its implementation would be useful to more people.
What I tried to create is below; I thought the compiler would have enough information to optimise the interface away, but benchmarking and Cthulhu show otherwise. If anyone could help me 1) understand where optimisations fail and why, and 2) how to fix this - Iβd really appreciate it.
First, we can encode the range of rows spanned by a substate at the type-level:
struct ValRange{START, STOP}
start::Val{START}
stop::Val{STOP}
end
ValRange(start::Integer, stop::Integer) = ValRange(Val(start), Val(stop))
And we can create a BlockSubstate containing a pointer to the raw full matrix and the associated ValRange for a given substate, and overload getindex and setindex to remap row indices:
struct BlockSubstate{S, START, STOP}
state::S
rows::ValRange{START, STOP}
end
@inline function Base.getindex(
s::BlockSubstate{S, START, STOP},
irow::Integer,
icol::Integer,
) where {S, START, STOP}
s.state[START - 1 + irow, icol]
end
@inline function Base.setindex!(
s::BlockSubstate{S, START, STOP},
x,
irow::Integer,
icol::Integer,
) where {S, START, STOP}
s.state[START - 1 + irow, icol] = x
end
And we can save the whole BlockState with the raw matrix and a NamedTuple mapping the substate names to a given ValRange of rows. Then we can overload getproperty to create a BlockSubstate when accessing a substate:
struct BlockState{M <: AbstractMatrix, D <: NamedTuple}
state::M
substates::D
end
@inline function Base.getproperty(s::BlockState, name::Symbol)
if name === :state
getfield(s, :state)
elseif name === :substates
getfield(s, :substates)
else
# Access a given ValRange from the .substates NamedTuple
irange = getfield(s, :substates)[name]
state = getfield(s, :state)
return BlockSubstate(state, irange)
end
end
So now we can create a state Matrix of, say, 1000 bodies, with 4 substates and wrap them up:
state = rand(Float32, 13, 1000)
substates = (
position=ValRange(1, 3),
velocity=ValRange(4, 6),
orientation=ValRange(7, 10),
angular_velocity=ValRange(11, 13),
)
bodies_state = BlockState(state, substates)
There it is, interface implemented; when we write bodies_state.velocity[1, icol], all required information should be known at compile-time: .velocity is translated to getproperty(bodies_state, :velocity), which we use to access the substates::NamedTuple, which should have the fields known at compile-time, create a temporary BlockSubstate which we then have the overloaded getindex on, and use the type-level values from the ValRange{4, 6} to remap 1 to START - 1 + 1, and ideally compile all that away into a simple state[4, icol].
That is not what happens:
using BenchmarkTools
println("Benchmarking direct access:")
display(@benchmark direct_access($bodies_state))
println("Benchmarking property access:")
display(@benchmark property_access($bodies_state))
# output
Benchmarking direct access:
BenchmarkTools.Trial: 10000 samples with 269 evaluations per sample.
Range (min β¦ max): 293.059 ns β¦ 508.364 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 318.773 ns β GC (median): 0.00%
Time (mean Β± Ο): 318.341 ns Β± 17.491 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ ββ
ββββ
ββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
293 ns Histogram: frequency by time 397 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Benchmarking property access:
BenchmarkTools.Trial: 5998 samples with 1 evaluation per sample.
Range (min β¦ max): 787.292 ΞΌs β¦ 1.928 ms β GC (min β¦ max): 0.00% β¦ 57.71%
Time (median): 818.062 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 832.980 ΞΌs Β± 47.899 ΞΌs β GC (mean Β± Ο): 0.23% Β± 1.98%
ββ
β
β
βββ
ββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββ β
787 ΞΌs Histogram: frequency by time 1.01 ms <
Memory estimate: 93.75 KiB, allocs estimate: 6000.
The code above is self-contained; I also added it to a Google Colab Notebook you can run.
The odd part is that Cthulhu shows type inference to work well, for example:
...
105 bodies_state::BlockState{Matrix{Float32}, @NamedTuple{position::ValRange{1, 3}, velocity::ValRange{4, 6}, orientation::ValRange{7, 10}, angular_velocity::ValRange{11, 13}}}.velocity::BlockSubstate{Matrix{Float32}, 4, 6}[1, i] = bodies_state::BlockState{Matrix{Float32}, @NamedTuple{position::ValRange{1, 3}, velocity::ValRange{4, 6}, orientation::ValRange{7, 10}, angular_velocity::ValRange{11, 13}}}.position::BlockSubstate{Matrix{Float32}, 1, 3}[1, i]::Float32
...
Using AllocCheck on property_access I see:
Allocating runtime call to "jl_get_pgcstack_static" in unknown location
Which I donβt really know how to interpret.
The typed code shows:
...
β %12 = dynamic Base.getproperty(bodies_state, :position)::BlockSubstate{Matrix{Float32}, 1, 3}
β %13 = i::Int64
β %14 = dynamic Base.getindex(%12, 1, %13)::Float32
β %15 = dynamic Base.getproperty(bodies_state, :velocity)::BlockSubstate{Matrix{Float32}, 4, 6}
β %16 = i::Int64
β dynamic Base.setindex!(%15, %14, 1, %16)::Any
...
But the assembly shows a lot more indirection and branching (and a lot more instructions) for property_access than direct_access.
Thatβs as far as I got. Anyone have a clue whatβs happening here?