Making an iterator that splats keys and values quickly into kwargs

This question is about how to improve iterator performance when kwargs splatting like (; x...).

I have information stored in a structure something like this:

struct MyStruct{T1A, T1B, T2A, T2B}
    a::NamedTuple{T1A, T1B}
    b::NamedTuple{T2A, T2B}
end

There can be some overlap in the values of a and b, so for this splat I’d like to prioritize one (but I don’t intend to throw away information or store a merged NamedTuple). I create some iterator functions, something like this:

Base.getindex(obj::MyStruct{T1A,T1B,T2A,T2B}, k) where {T1A,T1B,T2A,T2B} = getfield(k ∈ T2A ? obj.b : obj.a, k)
Base.keys(obj::MyStruct) = keys(merge(obj.a, obj.b))
Base.values(obj::MyStruct) = (obj[k] for k ∈ keys(obj))
Base.iterate(obj::MyStruct, itr=zip(keys(obj), values(obj))) = Iterators.peel(itr)

Then I test it for speed:

@btime let; [(MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
    # args splat     11.500 μs (2 allocations: 62.55 KiB)
@btime let; [(; MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...) for i=1:1000] end;
    # kwargs splat   850.900 μs (22004 allocations: 1.07 MiB)

It’s currently (much!) faster splatting into args (x...,) than kwargs (; x...), but I want it to be fast splatting into kwargs instead (I don’t care about args splatting because it causes type instability in the successive steps).

It seems as if NamedTuples have two different variants of iterators, one that’s called when splatting into args (lines 130 and 139 of namedtuple.jl which yields values without keys), and one called when splatting into kwargs (???), for example:

test = (a=1, b=4, c=5, d=6)
@show (test...,)    # (1, 4, 5, 6)
@show (; test...)   # (a = 1, b = 4, c = 5, d = 6)

and their performance is comparable:

@btime let; [(merge((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
    # args splat     6.680 μs (2 allocations: 31.30 KiB)
@btime let; [(; merge((a=rand(),b=rand()),(c=rand(),d=rand()))...) for i=1:1000] end;
    # kwargs splat   6.720 μs (2 allocations: 31.30 KiB)

but afaik I can only make one variant of Base.iterate, and its performance splatting into kwargs is horrible.

Am I missing something? :pray:

Splating two named tuples into keyword arguments automatically merges them.
For instance:

kw1 = (a=1,b=2)
kw2 = (b=20,c=3)
@assert (; kw1...,kw2...) == (a=1,b=20,c=3) 
@btime (; kw1...,kw2...)   # 34.105 ns (1 allocation: 32 bytes)  [on my machine]

Does this help?

1 Like

If you want an iterator for MyStruct, here is a suggestion:

function Base.iterate(x::MyStruct{T1A,<:Any,T2A,<:Any},(allkeys,nextkey)=(keys(x),1)) where {T1A,T2A}
           if nextkey > length(allkeys)
               return nothing
           end
           key,nextkey = iterate(allkeys,nextkey)
           val = key in T2A ? x.b[key] : x.a[key]
           return((key,val),(allkeys,nextkey))
       end

Base.keys(x::MyStruct{T1A,<:Any,T2A,<:Any}) where {T1A,T2A} = (union(T1A,T2A)...,)

 Base.length(x::MyStruct) = length(keys(x))

kw1 = (a=1,b=2)
kw2 = (b=20,c=3)
x = MyStruct(kw1,kw2)

@btime (; x...)  # 1.750 μs (43 allocations: 2.27 KiB)  [on my machine]

(; x...)   # => (a=1,b=20,c=3)

Although I believe there is probably a better way around it, starting from my previous answer

Now both splatting approaches are slow, ha.

@btime let; [(; MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...) for i=1:1000] end;
#    1.781 ms (58004 allocations: 2.76 MiB)
@btime let; [(MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
#    816.800 μs (34010 allocations: 1.60 MiB)

The primary adder to runtime was union. This function has horrible performance.

Changing Base.keys back to using merge instead:

Base.keys(x::MyStruct{T1A,<:Any,T2A,<:Any}) where {T1A,T2A} = keys(merge(x.a,x.b))

returns them back to their original (still way too slow for kwargs) times:

@btime let; [(; MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...) for i=1:1000] end;
    # 847.200 μs (22004 allocations: 1.07 MiB)
@btime let; [(MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
    # 11.000 μs (2 allocations: 62.55 KiB)

Admittedly, it’s new to me that optional arguments can be specified as tuples. That’s really cool!

Could be promising to combine both ideas:

Base.iterate(x::MyStruct{T1A,<:Any,T2A,<:Any}, (allitems, i) = ((; x.a...,x.b...), 1)) where {T1A,T2A} = begin
    i > length(allitems) && return nothing
    (item, i_next) = iterate(allitems,i)
    (item, (allitems, i_next))
end

This runs fast in args splatting, but unfortunately when trying to splat into kwargs it throws an error (since this iterator simply produces values without keys, like args-splatting a NamedTuple).

I don’t know how to tell Julia “Hey, when kwargs-splatting this thing, do it the way you’d kwargs-splat a NamedTuple. In fact, here just take this nt::NamedTuple and kwargs-splat it!”

union is doing something very similar to what I commented on for Dict in your last issue.

It splats arguments into a vector, which has to be some allocated bit of memory with a length only known at runtime, then splat that back into a tuple with length that needs to be known at compile time.

Better to keep everything in tuples, use functional transformations of the tuples, and let it all happen at compile time. This compiles away:

function Base.keys(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2}
    reduce(K2; init=K1) do acc, k
        k in acc ? acc : (acc..., k)
    end
end

x = MyStruct((a=1, b=2), (b=3, c=4))
using BenchmarkTools
@btime keys($x)
julia> @btime keys($x)
  1.750 ns (0 allocations: 0 bytes)
(:a, :b, :c)

And instead of using iterate at all, we can just define merge directly for MyStruct (and values, which is also useful):

x = MyStruct((a=1, b=2, d=4, e=5, f=6), (b=3, c=4))
function Base.values(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2}
    reduce(K2; init=values(x.a)) do acc, k
        k in K1 ? acc : (acc..., getfield(x.b, k))
    end
end
function Base.merge(nt::NamedTuple, x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2}
    (; nt..., NamedTuple{keys(x)}(values(x))...)
end

@btime (; $x...)
  3.480 ns (0 allocations: 0 bytes)
(a = 1, b = 2, d = 4, e = 5, f = 6, c = 4)

Edit: and kwargs benchmarks now match args

julia> @btime let; [(MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
  13.198 μs (2 allocations: 31.30 KiB)
julia> @btime let; [(; MyStruct((a=rand(),b=rand()),(c=rand(),d=rand()))...,) for i=1:1000] end;
  13.183 μs (2 allocations: 31.30 KiB)
5 Likes

I didn’t know you could interpolate $x in @btime. I wish I knew about this sooner!

This solves my problems perfectly! Thank you! I never thought to make a custom merge function; this language really keeps surprising to the upside.

2 Likes

yeah, this is a thing I dislike about Julia: many methods in Base return arrays when a generator would be better and just one collect away from the same result…

3 Likes

Also one of my main peeves. I think it was a simplification in early design to just default to using Array, but it seems like that was a mistake. map and union over tuples and named tuples should have always returned some kind of tuple.

3 Likes

This describes what merge does, or at least what the code says it should do. It calls merge_names, which splats a Tuple into a Vector and push!es into it several times, before splatting back into a Tuple (from namedtuple.jl):

@pure function merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
    @nospecialize an bn
    names = Symbol[an...]
    for n in bn
        if !sym_in(n, an)
            push!(names, n)
        end
    end
    (names...,)
end

Yet, this function runs lickety split:

julia> @btime Base.merge_names((:a,:b,:c),(:d,:e,:f))
  1.000 ns (0 allocations: 0 bytes)
(:a, :b, :c, :d, :e, :f)

so apparently the compiler is doing something very intelligent.

So I tried exploring a little bit. I defined my own function my_merge_names like so:

function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
    @nospecialize an bn
    names = Symbol[an...]
    for n in bn
        if !Base.sym_in(n, an)
            push!(names, n)
        end
    end
    (names...,)
end

Exactly the same, right? Let’s see its runtime:

julia> @btime my_merge_names((:a,:b,:c),(:d,:e,:f))
  248.529 ns (3 allocations: 224 bytes)
(:a, :b, :c, :d, :e, :f)

Yikes! What if we add that @pure macro call back?

Base.@pure function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
    @nospecialize an bn
    names = Symbol[an...]
    for n in bn
        if !Base.sym_in(n, an)
            push!(names, n)
        end
    end
    (names...,)
end

we get:

julia> @btime my_merge_names((:a,:b,:c),(:d,:e,:f))
  1.000 ns (0 allocations: 0 bytes)
(:a, :b, :c, :d, :e, :f)

boom pow ouchie wowchie fast. Now I must investigate what black magic this @pure thing is adding.

But then, if I make it return a vector instead of a tuple then I get poor performance again:

Base.@pure function my_merge_names_vec(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
    @nospecialize an bn
    names = Symbol[an...]
    for n in bn
        if !Base.sym_in(n, an)
            push!(names, n)
        end
    end
    names # <- all I did is remove a tuple splat
end


julia> @btime my_merge_names_vec((:a,:b,:c),(:d,:e,:f))
  242.534 ns (2 allocations: 160 bytes)
6-element Vector{Symbol}:
 :a
 :b
 :c
 :d
 :e
 :f

:thinking:

The @pure macro forces function purity (assumes it has zero external effects) and will compile everything.

But its very pure and using it in your own code is a disaster, you can no longer extend those methods, or use Revise on them, and many other things no-one understands except Jameson Nash. @assume_effects is the newer slightly safer way of doing things like this, @generated functions are another way.

From my example you can see you don’t always need these macros. And if you don’t, don’t use them. Either this use of @pure in Base is unnecessary, or it used to be necessary and isn’t anymore, or NamedTuple merge is so important it cant be left up to the whims of the compiler, which may stop compiling functional methods like mine away at a certain size of Tuple (that would be my guess).

Don’t play with the dark arts

(the vector is slow because you allocated memory to get it, tuples don’t allocate memory or even leave the registers unless they have to)

3 Likes

IMO it is a historical thing.
many functions in Base are much older than generators being easy in julia.

1 Like

Nice!

julia> Base.@assume_effects :foldable function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
           @nospecialize an bn
           names = Symbol[an...]
           for n ∈ bn
               n ∈ an || push!(names, n)
           end
           (names...,)
       end
my_merge_names (generic function with 1 method)

julia> @btime my_merge_names((:a,:b,:c),(:d,:e,:f))
  1.000 ns (0 allocations: 0 bytes)
(:a, :b, :c, :d, :e, :f)

I need more practice figuring out how to translate a function like merge_names into a @generated form… the manual examples were basically just reimplementations of multiple dispatch.

Yes absolutely! This is just as performant, and doesn’t rely on black magic:

julia> function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
           @nospecialize an bn
           names = reduce(bn; init=an) do acc, k
               k ∈ an ? acc : (acc..., k)
           end
           (names...,)
       end
my_merge_names (generic function with 1 method)

julia> @btime my_merge_names((:a,:b,:c),(:d,:e,:f))
  1.000 ns (0 allocations: 0 bytes)
(:a, :b, :c, :d, :e, :f)

Trying a sugary generator approach…

julia> function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
           @nospecialize an bn
           (an..., (k for k ∈ bn if k ∉ an)...)
       end
my_merge_names (generic function with 1 method)

julia> @btime my_merge_names((:a,:b,:c),(:d,:e,:f))
  2.000 ns (0 allocations: 0 bytes)
(:a, :b, :c, :d, :e, :f)

Almost as good. I wonder where the extra 1ns is coming from…

Numbers like 2ns and 1ns generally indicate the whole thing is being pushed to compile time.
Or at least they used to.

I suggeest looking at the @code_typed and @code_llvm for them

1 Like

Another version that runs in 1ns and does the job in one line:

function my_merge_names(an::Tuple{Vararg{Symbol}}, bn::Tuple{Vararg{Symbol}})
    @nospecialize an bn
    (an..., filter(k->k ∉ an, bn)...)
end

Oh how strange, I came back to revisit this and the generator stopped adding 1ns. I might be imagining things, but it did seem like a consistent 1ns adder over numerous @btime runs. Oh well, it looks like generators are fast!

Oh wow.

So trying this approach to make one-line solutions for the original problem, using a Generator becomes hugely slow! What the heck! The same approach which previously was quite performant, is now painfully slow.

Setup:

using BenchmarkTools
struct MyStruct{K1,V1,K2,V2}
    a::NamedTuple{K1,V1}
    b::NamedTuple{K2,V2}
end
x = MyStruct((a=1, b=2, d=4, e=5, f=6), (b=3, c=4))

First, making one-line solutions using map and filter to set a speed baseline:

Base.keys(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} = 
    (K1..., filter(k->k ∉ K1, K2)...)
Base.values(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} = 
    (values(x.a)..., map(Base.Fix1(getfield, x.b), filter(k->k ∉ K1, K2))...)
Base.merge(nt::NamedTuple, x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} = 
    (; nt..., NamedTuple{keys(x)}(values(x))...)

Runtime with map and filter:

julia> @btime keys($x)
  1.000 ns (0 allocations: 0 bytes)
(:a, :b, :d, :e, :f, :c)

julia> @btime values($x)
  2.300 ns (0 allocations: 0 bytes)
(1, 2, 4, 5, 6, 4)

julia> @btime (; $x...)
  2.600 ns (0 allocations: 0 bytes)
(a = 1, b = 2, d = 4, e = 5, f = 6, c = 4)

Nice and fast.

Okay, now trying one-line solutions using generators, code is almost identical:

Base.keys(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} = 
    (K1..., (k for k ∈ K2 if k ∉ K1)...)
Base.values(x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} = 
    (values(x.a)..., (getfield(x.b,k) for k ∈ K2 if k ∉ K1)...)
Base.merge(nt::NamedTuple, x::MyStruct{K1,<:Any,K2,<:Any}) where {K1,K2} = 
    (; nt..., NamedTuple{keys(x)}(values(x))...)

Runtime with Generators:

julia> @btime keys($x)
  309.045 ns (2 allocations: 96 bytes)
(:a, :b, :d, :e, :f, :c)

julia> @btime values($x)
  410.326 ns (5 allocations: 256 bytes)
(1, 2, 4, 5, 6, 4)

julia> @btime (; $x...)
  1.480 μs (11 allocations: 560 bytes)
(a = 1, b = 2, d = 4, e = 5, f = 6, c = 4)

Oof! Generators are slow af!

If I’m missing something, I don’t see it… it just seems like generators sometimes just become slow without rhyme or reason.

Considering that generators are themselves just syntax sugar for map and filter over iterators, this shouldn’t be the case?

Generators are not slow! Its the unknown length splat to a tuple that is slow.

map and reduce are using recursion. And at this complexity that compiles away completely. But a generator is an iterator like a for loop, that the compiler isn’t unrolling for you. So the length is a run-time value. So all that time is type instability on the tuple length.

Essentially nothing in Julia is slow unless there is type instability, it’s LLVM doing compilation just like it would in C. When you see terrible times like this assume type instability, and use @code_warntype, Cthulhu.@descend and ProvileView.@profview to work out what is cuasing it.

And also, for things like this you should just use recursion, there is a reason functional languages use it for everything.

2 Likes

I guess my confusion arises because I mistook the operation of generators based on this comment, understanding generators basically to provide a sugary lazy interface to map and filter, and because when writing my_merge_names using generators as above it was compiled away and ran fast. Yet here in Base.keys(::MyStruct) the generator does exactly the same thing yet allocates at runtime and runs slow. It seems like a challenge to figure out when or whether a generator will be unrolled by the compiler.

Youre confusing unrolling with constant propagation. If all the values are known at compile time the compiler can just run the function and give you the answer instead of other instructions. That’s what seems to be happening in your example that was fast.