[edit 21/03/07 discussion in this thread is about correct @pure use, triggered by the @pure annotation in my initial question about NamedTuple type processing “with zero runtime costs”, which follows]
I have a custom Base.getproperty implementation for a quite generic type PStruct, which uses reflection technique on type parameters given in a NamedTuple type.
It works, but slow (>factor 1000 slower than hand-coding its methods without runtime reflection).
I am asking for help in coding the function in a way that runtime reflection “compiles away”.
My topic is somewhat similar to this discussion: efficient-reflection-on-structs, however I could not successfully adopt the solution hints given there.
Code snipped with the function in question (full code is available as Julia package, see below):
primitive type PStruct{T<:NamedTuple} 64 end
Base.@pure function Base.getproperty(x::PStruct{T},s::Symbol) where T<:NamedTuple
@inbounds begin
shift = 0
types = T.parameters[2].parameters
syms = T.parameters[1]
idx = 1
while idx <= length(syms)
type = types[idx]
bits = bitsizeof(type)
if syms[idx]===s
v = _get(reinterpret(UInt64,x),Val(shift),Val(bits))
return _convert(type,v)
end
shift += bits
idx += 1
end
throw(ArgumentError(s))
end
end
# used helper methods look like:
@inline _get(pstruct::UInt64, shift, bits) = (pstruct>>>shift) & _mask(bits)
@inline _get(pstruct::UInt64, ::Val{shift},::Val{bits}) where {shift,bits} = (pstruct>>>shift) & _mask(bits)
@Base.pure bitsizeof(::Type{T}) where T = sizeof(T)*8
bitsizeof(::Type{Bool}) = 1
# more methods of bitsizeof exist, very similar structure
_convert(::Type{type},v::UInt64) where type = convert(type,v)
_convert(::Type{UInt64},v::UInt64) = v # to avoid ambiguity
_convert(::Type{type},v::UInt64) where type<:Signed = (v%Int64)<<(64-bitsizeof(type))>>(64-bitsizeof(type))
# more methods of _convert exist, very similar structure
To improve, I have put all reflection work in a @pure function _fieldsdescr having only type parameters.
In theory, dispatch could recognize that any method of _fielddescr returns a constant value, and replace its body by that constant.
However, runtime performance improves only a little. Apparently, my logic is too complex for the compiler to replace _fielddescr calls by constants. Code:
@Base.pure function _fielddescr(::Type{PStruct{T}},::Val{s}) where {T<:NamedTuple,s} # s isa Symbol
shift = 0
types = T.parameters[2].parameters
syms = T.parameters[1]
idx = 1
while idx <= length(syms)
type = types[idx]
bits = bitsizeof(type)
if syms[idx]===s
return type,shift, bits
end
shift += bits
idx += 1
end
throw(ArgumentError(s))
end
@inline Base.@pure function getpropertyV2(x::PStruct{T},s::Symbol) where T<:NamedTuple
type,shift,bits = _fielddescr(PStruct{T},Val(s))
return _convert(type,_get(reinterpret(UInt64,x),shift,bits))
end
I wrote some simple benchmarks, calling both variants in a loop on a Vector{PStruct}. For comparison, I run the same loop on a mostly equivalent struct, and I wrote getproperty variants for the concrete types and symbols, replacing _fieldtype call by its return value.
Results:
@btime bench(sv): some work on an ordinary struct, in a loop on a Vector to get stable timings
80.144 ns (0 allocations: 0 bytes)
@btime bench(psv): same work on PStruct having same fields as struct in preceding benchmark
2.457 ms (412 allocations: 6.44 KiB)
@btime benchV2(psv): same work, but using getpropertyV2 instead of getproperty for PStruct field access
304.299 µs (400 allocations: 12.50 KiB)
@btime benchV3(psv): same work, but handcoded getpropertyV3 replacing _fielddescr call by its result (simulated constant propagation)
115.026 ns (0 allocations: 0 bytes)
@btime benchV4(psv): same work, but handcoded getpropertyV4 with resulting SHIFT and AND operation
113.758 ns (0 allocations: 0 bytes)
Ideal solution: there is a trick to code _fielddescr in a way that the compiler replaces calls by their result, keeping clear, maintainable julia source code.
Acceptable alternative: reformulation of getpropertyV2 (or _fielddescr) as @generated function, where _fielddescr(PStruct{T},Val(s)) is called at compile time, and its result is pasted into generated code. However writing that is beyond my current level of macro/EXPR knowledge - maybe someone can help?
A surprising observation in the benchmark results are the allocations, their count mostly matches the getproperty calls (400 per benchmark run). I do not see any allocation in the code (it does not use heap objects) - did I overlook something?
All code is available here: PackedStructs
PStruct and its functions are defined in src/PackedStructs.jl, test/basics.jl contains examples and benchmarks.