Creating a performant function using metaprogramming

elenev · September 3, 2021, 10:13pm

I have a function that constructs a NamedTuple out of a vector of numbers subject to some transformation, something like

function vec2tuple(x)
( a = x[1], b = 1 + exp(x[2]), c = 2 + exp(x[3]), d = exp(x[4]) / (1 + exp(x[4]))
end

This function does 2 things. It transforms individual elements of x an appropriate transformation and then assembles the results into a named tuple with keys a, b, c, and d.

vec2tuple runs very fast and doesn’t allocate. But the downside is code duplication, which gets tedious as the size of the tuple increases. Also, vec2tuple is kind of hard to read b/c it doesn’t make explicit the logic that informs the choice of the transformation. Ultimately, every xi in x will get transformed based on a known range of values each element of the NamedTuple can take. In the example above, a can take any value on the real line, b must be in (1,Inf), c must be in (2,Inf), and d must be in (0,1).

What I really want is to specify a set of rules in a legible way and then have these transformations be constructed accordingly. Here is an example for how to express these rules, but I’m not wedded to this.

bounds = (a = (-Inf,Inf), b = (1,Inf), c = (2,Inf), d = (0,1) )

I first decided to tackle this using functional programming. I wrote a function that returns anonymous functions to perform the transformation for any kind of open interval:

function get_transform_function(arg)
    lb, ub = arg
    if lb>=ub
        throw(ArgumentError("Upper bound must be strictly greater than lower bound"))
    end
    if lb==-Inf && ub==Inf
        return x -> x
    elseif lb!=Inf && ub==Inf
        return x -> lb + exp(x)
    elseif lb==-Inf && ub!=Inf
        return x -> ub-exp(-x)
    else
        return x -> lb + (ub-lb)*exp(x)/(1+exp(x))
    end
end

I then constructed rules for each of a,b, c, and d:

rules = map(x -> get_transform_function(x), bounds)

And wrote transform to call instead vec2tuple:

function transform(rules::NamedTuple{Names,<:Tuple}, vec::AbstractVector) where Names
    N = length(rules)
    return NamedTuple{Names}(ntuple(i -> rules[i](vec[i]), N))
end

Problem is, transform is about 20x slower than vec2tuple because it allocates a lot.

It seems like the solution should be in meta-programming. Something as performant as vec2tuple should be generated once based on the information in bounds before being called many times.

I tried rewriting get_transform_function so that it takes an additional argument for the value to be transformed and returns expressions instead of anonymous functions, and then construct a transformation as follows. Note that I’m passing an expression as an argument to get_transform_function so it gets interpolated as x[i], not any specific value.

names=keys(bounds)
args=[get_transform_function(:(x[$i]), v) for (i,v) in enumerate(bounds)]
tup=:($(Expr(:tuple, arguments...)))
transform=:(NamedTuple{$names}(  :($(Expr(:tuple, arguments...))) ))

But transform is an expression that needs to be evaluated for a given value of x when x is in scope. This sounds like it should be a macro that returns a vec2tuple(x)-style function, but I can’t quite figure out how to write this.

And furthermore, what if I don’t know bounds at compile-time? I can still know it before I need to call vec2tuple a whole bunch of time, so I don’t mind paying a price to construction this function once during runtime, as long as it’s fast when I call it.

Sorry for the long post and a less-than-specific question, but any ideas are really welcome! Thanks.

lmiq · September 3, 2021, 10:53pm

If the vector can be a static vector, or of you pass the size of the vector, N, as a type parameter, that version should be fast. I would totally opt for that (the key is: N must be known at compile time, which means at the moment you call the function).

MWE:

julia> f(x,y,::Val{N}) where N = ntuple(i -> x[i] => y[i], N)
f (generic function with 1 method)

julia> x = [:a,:b]; y = [1,2]; N=2;

julia> @btime f($x,$y,$(Val(N)))
  3.181 ns (0 allocations: 0 bytes)
(:a => 1, :b => 2)

julia> f(x,y,Val(N))
(:a => 1, :b => 2)

or

julia> using StaticArrays

julia> f(x,y::SVector{N}) where N = ntuple(i -> x[i] => y[i], N)
f (generic function with 1 method)

julia> x = [:a,:b,:c]; y = SVector{3,Float64}(1,2,3);

julia> @btime f($x,$y)
  3.493 ns (0 allocations: 0 bytes)
(:a => 1.0, :b => 2.0, :c => 3.0)

Per · September 5, 2021, 6:06am

I get about the same performance for both functions:

julia> using BenchmarkTools

julia> @btime vec2tuple([1.0,2.0,3.0,4.0]);
  61.417 ns (1 allocation: 112 bytes)

julia> @btime transform($rules, [1.0,2.0,3.0,4.0]);
  64.351 ns (1 allocation: 112 bytes)

(This is on Julia 1.6.2 on an M1 using rosetta.)

Edit: I also tried Julia 1.7.0-beta4 M1-native, which cut the times in half, but still no difference between the two functions.

lmiq · September 5, 2021, 10:19am

This is what you need IMO:

julia> function transform(rules::NamedTuple, vec::AbstractVector, ::Val{N}) where N
               return ntuple(i -> rules[i](vec[i]), N)
       end
transform (generic function with 1 method)

julia> @btime transform($rules, $[1.0,2.0,3.0,4.0], $(Val(4)))
  26.567 ns (0 allocations: 0 bytes)
(1.0, 8.38905609893065, 22.085536923187668, 0.9820137900379085)

elenev · September 10, 2021, 7:48pm

Sorry for the late response, guys. Crazy week and only now getting a chance to get back to this.

I can replicate your results for 4-element vectors. But something changes drastically between 10 and 11. Here’s a self-contained MWE:

function transform(rules::NamedTuple, vec::AbstractVector, ::Val{N}) where N
    return ntuple(i -> rules[i](vec[i]), N)
end

keynames = [Symbol(Char(n+'0')) for n=49:(49+25)] # creates :a, :b, :c, ... :z

N = 10
x = rand(N)
rules = NamedTuple{tuple(keynames[1:N]...)}(tuple([exp for i in 1:N]...))
@btime transform($rules,$x,$(Val(N)))

N = 11
x = rand(N)
rules = NamedTuple{tuple(keynames[1:N]...)}(tuple([exp for i in 1:N]...))
@btime transform($rules,$x,$(Val(N)))

Output is

  64.257 ns (0 allocations: 0 bytes)
  566.486 ns (13 allocations: 448 bytes)

Just to confirm the discontinuity at 11, I ran it for the whole alphabet:

times = Vector{Float64}(undef,26)
allocs = Vector{Float64}(undef,26)
memory = Vector{Float64}(undef,26)
for N=1:26
    x = rand(N)
    rules = NamedTuple{tuple(keynames[1:N]...)}(tuple([exp for i in 1:N]...))
    bres = @benchmark transform($rules,$x,$(Val(N)))
    times[N] = mean(bres.times)
    allocs[N] = mean(bres.allocs)
    memory[N] = mean(bres.memory)
end

Allocations:

Time:

Note that the discontinuity isn’t just in levels but also in slope (flat when N<=10, increasing linearly when N>10). Same thing happens if I make x an SVector and then dispatch on vec::SVector{N} where N. Of course, this degradation does not happen with vec2tuple.

In my application, I don’t expect N to be arbitrarily large, but values in the range of 11-30 are possible, so it would be nice to avoid this degradation. Sorry about using N=4 in my original example. It was just an example, so I didn’t realize your suggestions would be optimized for N<=10.

lmiq · September 10, 2021, 8:24pm

Take a look at this thread: Efficient recursive tuple construction

it is the same problem. At some point you just have to use standard arrays.

elenev · September 10, 2021, 8:35pm

Indeed. Just looked at the code for ntuple() and it’s just manually optimized for N=1,…10:
julia/ntuple.jl at master · JuliaLang/julia (github.com)

lmiq · September 10, 2021, 8:36pm

At least it is easy to optimize it to other sizes

elenev · September 10, 2021, 8:55pm

Actually, it looks like the manual optimization is just for the standard call ntuple(f,N). But you can avoid the N=11 cliff by calling ntuple(f,Val(N)) (just learned that you can do this). So basically just do propagate your suggested improvement for my function down to ntuple.

markmbaum · September 10, 2021, 10:06pm

That’s so funny. The cause of the performance dropoff is totally obvious if you take a second to look for it. I would never have known.