# Making an iterator that splats keys and values quickly into kwargs

This question is about how to improve iterator performance when kwargs splatting like `(; x...)`.

I have information stored in a structure something like this:

``````struct MyStruct{T1A, T1B, T2A, T2B}
a::NamedTuple{T1A, T1B}
b::NamedTuple{T2A, T2B}
end
``````

There can be some overlap in the values of `a` and `b`, so for this splat I’d like to prioritize one (but I don’t intend to throw away information or store a `merg`ed `NamedTuple`). I create some iterator functions, something like this:

``````Base.getindex(obj::MyStruct{T1A,T1B,T2A,T2B}, k) where {T1A,T1B,T2A,T2B} = getfield(k ∈ T2A ? obj.b : obj.a, k)
Base.keys(obj::MyStruct) = keys(merge(obj.a, obj.b))
Base.values(obj::MyStruct) = (obj[k] for k ∈ keys(obj))
Base.iterate(obj::MyStruct, itr=zip(keys(obj), values(obj))) = Iterators.peel(itr)
``````

Then I test it for speed:

``````@btime let; [(MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
# args splat     11.500 μs (2 allocations: 62.55 KiB)
@btime let; [(; MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...) for i=1:1000] end;
# kwargs splat   850.900 μs (22004 allocations: 1.07 MiB)
``````

It’s currently (much!) faster splatting into args `(x...,)` than kwargs `(; x...)`, but I want it to be fast splatting into kwargs instead (I don’t care about args splatting because it causes type instability in the successive steps).

It seems as if `NamedTuple`s have two different variants of iterators, one that’s called when splatting into args (lines 130 and 139 of namedtuple.jl which yields values without keys), and one called when splatting into kwargs (???), for example:

``````test = (a=1, b=4, c=5, d=6)
@show (test...,)    # (1, 4, 5, 6)
@show (; test...)   # (a = 1, b = 4, c = 5, d = 6)
``````

and their performance is comparable:

``````@btime let; [(merge((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
# args splat     6.680 μs (2 allocations: 31.30 KiB)
@btime let; [(; merge((a=rand(),b=rand()),(c=rand(),d=rand()))...) for i=1:1000] end;
# kwargs splat   6.720 μs (2 allocations: 31.30 KiB)
``````

but afaik I can only make one variant of `Base.iterate`, and its performance splatting into kwargs is horrible.

Am I missing something?

Splating two named tuples into keyword arguments automatically merges them.
For instance:

``````kw1 = (a=1,b=2)
kw2 = (b=20,c=3)
@assert (; kw1...,kw2...) == (a=1,b=20,c=3)
@btime (; kw1...,kw2...)   # 34.105 ns (1 allocation: 32 bytes)  [on my machine]
``````

Does this help?

1 Like

If you want an iterator for `MyStruct`, here is a suggestion:

``````function Base.iterate(x::MyStruct{T1A,<:Any,T2A,<:Any},(allkeys,nextkey)=(keys(x),1)) where {T1A,T2A}
if nextkey > length(allkeys)
return nothing
end
key,nextkey = iterate(allkeys,nextkey)
val = key in T2A ? x.b[key] : x.a[key]
return((key,val),(allkeys,nextkey))
end

Base.keys(x::MyStruct{T1A,<:Any,T2A,<:Any}) where {T1A,T2A} = (union(T1A,T2A)...,)

Base.length(x::MyStruct) = length(keys(x))

kw1 = (a=1,b=2)
kw2 = (b=20,c=3)
x = MyStruct(kw1,kw2)

@btime (; x...)  # 1.750 μs (43 allocations: 2.27 KiB)  [on my machine]

(; x...)   # => (a=1,b=20,c=3)
``````

Although I believe there is probably a better way around it, starting from my previous answer

Now both splatting approaches are slow, ha.

``````@btime let; [(; MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...) for i=1:1000] end;
#    1.781 ms (58004 allocations: 2.76 MiB)
@btime let; [(MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
#    816.800 μs (34010 allocations: 1.60 MiB)
``````

The primary adder to runtime was `union`. This function has horrible performance.

Changing `Base.keys` back to using `merge` instead:

``````Base.keys(x::MyStruct{T1A,<:Any,T2A,<:Any}) where {T1A,T2A} = keys(merge(x.a,x.b))
``````

returns them back to their original (still way too slow for kwargs) times:

``````@btime let; [(; MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...) for i=1:1000] end;
# 847.200 μs (22004 allocations: 1.07 MiB)
@btime let; [(MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
# 11.000 μs (2 allocations: 62.55 KiB)
``````

Admittedly, it’s new to me that optional arguments can be specified as tuples. That’s really cool!

Could be promising to combine both ideas:

``````Base.iterate(x::MyStruct{T1A,<:Any,T2A,<:Any}, (allitems, i) = ((; x.a...,x.b...), 1)) where {T1A,T2A} = begin
i > length(allitems) && return nothing
(item, i_next) = iterate(allitems,i)
(item, (allitems, i_next))
end
``````

This runs fast in args splatting, but unfortunately when trying to splat into kwargs it throws an error (since this iterator simply produces values without keys, like args-splatting a `NamedTuple`).

I don’t know how to tell Julia “Hey, when kwargs-splatting this thing, do it the way you’d kwargs-splat a `NamedTuple`. In fact, here just take this `nt::NamedTuple` and kwargs-splat it!”

`union` is doing something very similar to what I commented on for `Dict` in your last issue.

It splats arguments into a vector, which has to be some allocated bit of memory with a length only known at runtime, then splat that back into a tuple with length that needs to be known at compile time.

Better to keep everything in tuples, use functional transformations of the tuples, and let it all happen at compile time. This compiles away:

``````function Base.keys(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2}
reduce(K2; init=K1) do acc, k
k in acc ? acc : (acc..., k)
end
end

x = MyStruct((a=1, b=2), (b=3, c=4))
using BenchmarkTools
@btime keys(\$x)
julia> @btime keys(\$x)
1.750 ns (0 allocations: 0 bytes)
(:a, :b, :c)
``````

And instead of using `iterate` at all, we can just define `merge` directly for `MyStruct` (and `values`, which is also useful):

``````x = MyStruct((a=1, b=2, d=4, e=5, f=6), (b=3, c=4))
function Base.values(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2}
reduce(K2; init=values(x.a)) do acc, k
k in K1 ? acc : (acc..., getfield(x.b, k))
end
end
function Base.merge(nt::NamedTuple, x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2}
(; nt..., NamedTuple{keys(x)}(values(x))...)
end

@btime (; \$x...)
3.480 ns (0 allocations: 0 bytes)
(a = 1, b = 2, d = 4, e = 5, f = 6, c = 4)
``````

Edit: and kwargs benchmarks now match args

``````julia> @btime let; [(MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
13.198 μs (2 allocations: 31.30 KiB)
julia> @btime let; [(; MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
13.183 μs (2 allocations: 31.30 KiB)
``````
5 Likes

I didn’t know you could interpolate `\$x` in `@btime`. I wish I knew about this sooner!

This solves my problems perfectly! Thank you! I never thought to make a custom `merge` function; this language really keeps surprising to the upside.

2 Likes

yeah, this is a thing I dislike about Julia: many methods in Base return arrays when a generator would be better and just one `collect` away from the same result…

3 Likes

Also one of my main peeves. I think it was a simplification in early design to just default to using `Array`, but it seems like that was a mistake. `map` and `union` over tuples and named tuples should have always returned some kind of tuple.

3 Likes

This describes what `merge` does, or at least what the code says it should do. It calls `merge_names`, which splats a `Tuple` into a `Vector` and `push!`es into it several times, before splatting back into a `Tuple` (from namedtuple.jl):

``````@pure function merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
@nospecialize an bn
names = Symbol[an...]
for n in bn
if !sym_in(n, an)
push!(names, n)
end
end
(names...,)
end
``````

Yet, this function runs lickety split:

``````julia> @btime Base.merge_names((:a,:b,:c),(:d,:e,:f))
1.000 ns (0 allocations: 0 bytes)
(:a, :b, :c, :d, :e, :f)
``````

so apparently the compiler is doing something very intelligent.

So I tried exploring a little bit. I defined my own function `my_merge_names` like so:

``````function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
@nospecialize an bn
names = Symbol[an...]
for n in bn
if !Base.sym_in(n, an)
push!(names, n)
end
end
(names...,)
end
``````

Exactly the same, right? Let’s see its runtime:

``````julia> @btime my_merge_names((:a,:b,:c),(:d,:e,:f))
248.529 ns (3 allocations: 224 bytes)
(:a, :b, :c, :d, :e, :f)
``````

Yikes! What if we add that `@pure` macro call back?

``````Base.@pure function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
@nospecialize an bn
names = Symbol[an...]
for n in bn
if !Base.sym_in(n, an)
push!(names, n)
end
end
(names...,)
end
``````

we get:

``````julia> @btime my_merge_names((:a,:b,:c),(:d,:e,:f))
1.000 ns (0 allocations: 0 bytes)
(:a, :b, :c, :d, :e, :f)
``````

boom pow ouchie wowchie fast. Now I must investigate what black magic this `@pure` thing is adding.

But then, if I make it return a vector instead of a tuple then I get poor performance again:

``````Base.@pure function my_merge_names_vec(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
@nospecialize an bn
names = Symbol[an...]
for n in bn
if !Base.sym_in(n, an)
push!(names, n)
end
end
names # <- all I did is remove a tuple splat
end

julia> @btime my_merge_names_vec((:a,:b,:c),(:d,:e,:f))
242.534 ns (2 allocations: 160 bytes)
6-element Vector{Symbol}:
:a
:b
:c
:d
:e
:f
``````

The `@pure` macro forces function purity (assumes it has zero external effects) and will compile everything.

But its very pure and using it in your own code is a disaster, you can no longer extend those methods, or use Revise on them, and many other things no-one understands except Jameson Nash. `@assume_effects` is the newer slightly safer way of doing things like this, `@generated` functions are another way.

From my example you can see you don’t always need these macros. And if you don’t, don’t use them. Either this use of `@pure` in Base is unnecessary, or it used to be necessary and isn’t anymore, or `NamedTuple` `merge` is so important it cant be left up to the whims of the compiler, which may stop compiling functional methods like mine away at a certain size of `Tuple` (that would be my guess).

Don’t play with the dark arts

(the vector is slow because you allocated memory to get it, tuples don’t allocate memory or even leave the registers unless they have to)

3 Likes

IMO it is a historical thing.
many functions in Base are much older than generators being easy in julia.

1 Like

Nice!

``````julia> Base.@assume_effects :foldable function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
@nospecialize an bn
names = Symbol[an...]
for n ∈ bn
n ∈ an || push!(names, n)
end
(names...,)
end
my_merge_names (generic function with 1 method)

julia> @btime my_merge_names((:a,:b,:c),(:d,:e,:f))
1.000 ns (0 allocations: 0 bytes)
(:a, :b, :c, :d, :e, :f)
``````

I need more practice figuring out how to translate a function like `merge_names` into a `@generated` form… the manual examples were basically just reimplementations of multiple dispatch.

Yes absolutely! This is just as performant, and doesn’t rely on black magic:

``````julia> function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
@nospecialize an bn
names = reduce(bn; init=an) do acc, k
k ∈ an ? acc : (acc..., k)
end
(names...,)
end
my_merge_names (generic function with 1 method)

julia> @btime my_merge_names((:a,:b,:c),(:d,:e,:f))
1.000 ns (0 allocations: 0 bytes)
(:a, :b, :c, :d, :e, :f)
``````

Trying a sugary generator approach…

``````julia> function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
@nospecialize an bn
(an..., (k for k ∈ bn if k ∉ an)...)
end
my_merge_names (generic function with 1 method)

julia> @btime my_merge_names((:a,:b,:c),(:d,:e,:f))
2.000 ns (0 allocations: 0 bytes)
(:a, :b, :c, :d, :e, :f)
``````

Almost as good. I wonder where the extra 1ns is coming from…

Numbers like 2ns and 1ns generally indicate the whole thing is being pushed to compile time.
Or at least they used to.

I suggeest looking at the `@code_typed` and `@code_llvm` for them

1 Like

Another version that runs in 1ns and does the job in one line:

``````function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
@nospecialize an bn
(an..., filter(k->k ∉ an, bn)...)
end
``````

Oh how strange, I came back to revisit this and the generator stopped adding 1ns. I might be imagining things, but it did seem like a consistent 1ns adder over numerous `@btime` runs. Oh well, it looks like generators are fast!

Oh wow.

So trying this approach to make one-line solutions for the original problem, using a `Generator` becomes hugely slow! What the heck! The same approach which previously was quite performant, is now painfully slow.

Setup:

``````using BenchmarkTools
struct MyStruct{K1,V1,K2,V2}
a::NamedTuple{K1,V1}
b::NamedTuple{K2,V2}
end
x = MyStruct((a=1, b=2, d=4, e=5, f=6), (b=3, c=4))
``````

First, making one-line solutions using `map` and `filter` to set a speed baseline:

``````Base.keys(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} =
(K1..., filter(k->k ∉ K1, K2)...)
Base.values(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} =
(values(x.a)..., map(Base.Fix1(getfield, x.b), filter(k->k ∉ K1, K2))...)
Base.merge(nt::NamedTuple, x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} =
(; nt..., NamedTuple{keys(x)}(values(x))...)
``````

Runtime with `map` and `filter`:

``````julia> @btime keys(\$x)
1.000 ns (0 allocations: 0 bytes)
(:a, :b, :d, :e, :f, :c)

julia> @btime values(\$x)
2.300 ns (0 allocations: 0 bytes)
(1, 2, 4, 5, 6, 4)

julia> @btime (; \$x...)
2.600 ns (0 allocations: 0 bytes)
(a = 1, b = 2, d = 4, e = 5, f = 6, c = 4)
``````

Nice and fast.

Okay, now trying one-line solutions using generators, code is almost identical:

``````Base.keys(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} =
(K1..., (k for k ∈ K2 if k ∉ K1)...)
Base.values(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} =
(values(x.a)..., (getfield(x.b,k) for k ∈ K2 if k ∉ K1)...)
Base.merge(nt::NamedTuple, x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} =
(; nt..., NamedTuple{keys(x)}(values(x))...)
``````

Runtime with Generators:

``````julia> @btime keys(\$x)
309.045 ns (2 allocations: 96 bytes)
(:a, :b, :d, :e, :f, :c)

julia> @btime values(\$x)
410.326 ns (5 allocations: 256 bytes)
(1, 2, 4, 5, 6, 4)

julia> @btime (; \$x...)
1.480 μs (11 allocations: 560 bytes)
(a = 1, b = 2, d = 4, e = 5, f = 6, c = 4)
``````

Oof! Generators are slow af!

If I’m missing something, I don’t see it… it just seems like generators sometimes just become slow without rhyme or reason.

Considering that generators are themselves just syntax sugar for `map` and `filter` over iterators, this shouldn’t be the case?

Generators are not slow! Its the unknown length splat to a tuple that is slow.

`map` and `reduce` are using recursion. And at this complexity that compiles away completely. But a generator is an iterator like a for loop, that the compiler isn’t unrolling for you. So the length is a run-time value. So all that time is type instability on the tuple length.

Essentially nothing in Julia is slow unless there is type instability, it’s LLVM doing compilation just like it would in C. When you see terrible times like this assume type instability, and use `@code_warntype`,` Cthulhu.@descend` and `ProvileView.@profview` to work out what is cuasing it.

And also, for things like this you should just use recursion, there is a reason functional languages use it for everything.

2 Likes

I guess my confusion arises because I mistook the operation of generators based on this comment, understanding generators basically to provide a sugary lazy interface to `map` and `filter`, and because when writing `my_merge_names` using generators as above it was compiled away and ran fast. Yet here in `Base.keys(::MyStruct)` the generator does exactly the same thing yet allocates at runtime and runs slow. It seems like a challenge to figure out when or whether a generator will be unrolled by the compiler.

Youre confusing unrolling with constant propagation. If all the values are known at compile time the compiler can just run the function and give you the answer instead of other instructions. That’s what seems to be happening in your example that was fast.