Allocation-free type hiding?

pablosanjose · August 18, 2021, 3:47pm

I’m trying to reduce latency in a package, and was intrigued by the mention of “type hiding” in the Dataframes.jl Juliacon talk by @bkamins (State of DataFrames.jl - YouTube).

As I understand it, type hiding consists of wrapping a variable that can have a wide range of types inside a struct without any free parameters, so that its type is only dealt with deep inside a function call tree, when it is unwrapped. This avoids unnecessary specialization and inference along the way.

Problem: I cannot make this kind of approach allocation-free. Here is an absolutely minimal example:

julia> struct S
           f
       end

julia> test(s::S, i) = (s.f() < i) === true;

julia> x = S(randn);

julia> using BenchmarkTools; @btime test($x, 0);
  32.438 ns (1 allocation: 16 bytes)

Note that, despite f being of type Any, test always returns a Bool, so any non-inferable types are confined within test, which unwraps the problematic f::Any field. The goal is to make test allocation-free, in spite of this. That single allocation might seem irrelevant, but it isn’t if you put this test in a hot loop.

If I use the standard approach of fully parametrized types (i.e. no “type-hiding”), I can get rid of the allocation, of course

julia> struct S{F}
           f::F
       end

julia> test(s::S, i) = (s.f() < i) === true;

julia> x = S(randn);

julia> using BenchmarkTools; @btime test($x, 0);
  5.611 ns (0 allocations: 0 bytes)

Question: is there any way to make this kind of type-hiding trick zero-cost in terms of allocations and runtime?

Constraints: I’d like to avoid Base.@pure if possible, as I can make no assumption on the field f. Also, the field f might be of any type, not just Function

Thanks for any help!

pablosanjose · August 18, 2021, 4:17pm

I need to look for a better MWE. This one is fixed easily by putting an extra function barrier (but this doesn’t work in my real-world problem)… The compiler is difficult to predict!

julia> struct S
           f
       end

julia> test(s::S) = _test(s.f);

julia> _test(f::Function) = f() < 0 === true;

julia> x = S(randn);

julia> using BenchmarkTools; @btime test($x)
  12.873 ns (0 allocations: 0 bytes)

pablosanjose · August 18, 2021, 7:06pm

Ok, second try. This MWE is more similar to my real code, and indeed it seems to allocate no matter what I try. It is also much slower than the parametrized S{F} version

julia> struct S
           f
       end

julia> test(s::S, r) = _test(s.f, r);

julia> _test(f::Function, r) = f(r) === true;

julia> x = S(r -> r < 0.5);

julia> function run(x::S)
           z = 0
           for _ in 1:1e5
               r = rand()
               z += test(x, r)
           end
           return z
       end
run (generic function with 1 method)

julia> using BenchmarkTools; @btime run($x);
  1.712 ms (100000 allocations: 1.53 MiB)

jling · August 18, 2021, 7:13pm

I feel like the point of hiding is to trade latency with some boxing cost, so this seems, reasonable? to my eyes at least

pablosanjose · August 18, 2021, 7:28pm

Uhm, I was hoping to have my cake and eat it too…

I guess that I don’t fully understand why we need an allocation. I guess it has to do with the need to store f in the heap when you don’t know its type, or something. So perhaps, as you say @jling, this is unavoidable with the above approach?

My ultimate goal is to be able to precompile test to reduce latency (well, a rather larger test function of my real code that involves complicated operations that don’t rely on the type of f until they hit _test). But I cannot do it if S is parametric, only if it is parameter-free. Maybe there is an altogether different approach to do this?

jling · August 18, 2021, 7:35pm

hm, again, I *think the approach outlined here is the opposite, hiding type means to reduce latency by *not compiling for any types (because by construction you cannot / not worth it)

pablosanjose · August 18, 2021, 7:41pm

But indeed, that’s how I understand it. Imagine you have a complicated function

function test(s::S)
    # lots of code that does not rely on the specific s.f::Any
    # and generates `data`  <-- worthwhile to precompile
    result = 0
    for i in data
        result += _test(s.f, i)
        # _test cannot be precompiled because it takes any s.f, 
        # but it is tiny and always returns a `Bool`
    end
    return result
end

In the above case, test(::S) is fully inferable, and precompile(test, (S,)) goes a long way to reduce your latency. The problem is that the _test allocation is making your loop slow…

jling · August 18, 2021, 7:45pm

this is why you shouldn’t loop over rows of dataframe without wrapper types (like using DataFrameMeta). In DataFrame’s case it’s the opposite, the underlying data is not typed because it’s not suitable for general use case (many many columns). To overcome this if you want tight loop over rows, you put a wrapper type for the “row” as a way to add information given to compiler.

Your underlying data can be typed (by adding parameter to S), but to reduce latency, you chose to hide it. If loop performance is >>> latency, just don’t do it, latency will be overshadowed by one time compilation anyway.

If you want fast time-to-first-iteration AND fast loop, maybe https://github.com/tisztamo/Catwalk.jl but I honestly don’t know if it works in this case

pablosanjose · August 18, 2021, 7:48pm

Ok, so you say that when looping over untyped rows in DataFrames, you indeed wrap the data as above, and pay the price of boxing? In other words, you’re suggesting it’s a matter of choosing between latency and runtime performance, yes?

Looking at the amazing performance of DataFrames I thought there was some clever trick to this…

jling · August 18, 2021, 7:51pm

ok I lied, the way dataframemeta works is to transform (using macro) row loop into indexing column, which gives you type stability because each column is typed:

for r in eachrow(df)
   r.x
end
# becomes
for i in 1:size(df, 1)
   df.x[1]
end

but yes, indeed, this could even be the main theme of Julia maybe, in terms of optimizing “user experience”

pdeffebach · August 18, 2021, 8:41pm

This is close, but not quite. The above for-loop will still be slow because Julia won’t know what type df.x[i] is. Since it won’t know what type df.x is.

DataFramesMeta.jl gets around this problem by using an anonymous function that accepts wohle columns. This function barrier let’s Julia make optimal code.

@eachrow! df begin 
    :a = :b + :c
end

becomes

function _t(a, b, c)
    for i in 1:length(a)
        a[i] = b[i] + c[i]
    end
end
_t(df.a, df.b, df.c)

pablosanjose · August 18, 2021, 10:49pm

Thanks @pdeffebach . However this kind of strategy is analogous to the _test barrier, which doesn’t solve my problem.

The following is promising, though:

julia> import FunctionWrappers

julia> import FunctionWrappers: FunctionWrapper

julia> struct S
           f::FunctionWrapper{Bool,Tuple{Float64}}
       end

julia> test(s::S, r::Float64) = s.f(r)
test (generic function with 1 method)

julia> x = S(FunctionWrapper{Bool,Tuple{Float64}}(r -> r < 0.5));

julia> function run(s::S)
           z = 0
           for _ in 1:1e5
               r = rand()
               z += test(s, r)
           end
           return z
       end
run (generic function with 1 method)

julia> using BenchmarkTools; @btime run($x)
  601.449 μs (0 allocations: 0 bytes)

Maybe the trick is to wrap everything inside a test::FunctionWrapper…

pdeffebach · August 18, 2021, 11:12pm

I don’t quite understand what problem you are trying to solve, actually.

The function wrapper strategy used in DataFrames.jl is good for when you have lots of different calls with different types and you don’t want each new call to take too long.

In a hot loop like what you are doing, the function will only ever get compiled once in the entire execution of the loop. It’s not like you are giving the function many different heterogeneous inputs.

Also, the DataFrames.jl function-wrapping strategy only solves a very specific problem: compile times for new calls. It doesn’t improve performance at all. Are you having a problem with large compile times for every new time you run a function?

Also, the strategy in @eachrow! is about inference. Because data frames do not have typed columns, Julia can’t know the type of the columns. By making an anonymous function, I actually do the opposite of the function wrapping strategy: I force more compilation on types and do it earlier. It can actually get pretty annoying! That’s why a ton of development effort has been put into not constructing anonymous functions whenever possible.

pablosanjose · August 18, 2021, 11:42pm

Ok, let me try to explain my problem more clearly

The user provides an input user_input, whose type is not known. In particular user_input can be a Function. Since each function is of a different type in Julia there is an infinite number of possible input types
The package performs a computation using the input, let’s call it build_results!(results, user_input). Think if it as something like

function build_results!(results::Vector{Float64}, user_input)
	# some complex code with non-trivial inference
	# somewhere inside this code there is this loop
	for r in vector
		test(user_input, r) && push!(results, r)
	end
	return results
end

The function test(user_input, r) is known to return a Bool, so inference of build_results! is independent of user_input
Now, on the first call to build_results! we have substantial latency due to inference. We wish to reduce it. But we cannot do precompile(build_results!, (Vector{Float64}, Any)), because we don’t know the type of user_input::Any, even if it should not be required to infer types in build_results!
The type-hiding trick consists of wrapping user_input in a wrapper type

struct Wrapper
	user_input
end

so that we now have

function build_results!(results::Vector{Float64}, wrapped_user_input::Wrapper)
	# some complex code with non-trivial inference
	# somewhere inside this code there is this loop
	for r in vector
		test(wrapped_user_input, r) && push!(results, r)
	end
	return results
end

test(x::Wrapper) = _test(x.user_input)
_test(user_input) = #actual test, returns Bool

With the above we can do precompile(build_results!, (Vector{Float64}, Wrapper)), and eliminate most of the latency. However, in doing so we are forced to have a slower runtime, because the unwrapping test(x::Wrapper) = _test(x.user_input) allocates no matter what
The FunctionWrappers.jl approach in the above post could (perhaps?) be made to work by defining test as a FunctionWrapper{Bool, Tuple{Float64}}(user_input) that wraps the user input (assuming it is a Function). In that case we could maybe do

function build_results!(results::Vector{Float64}, test::FunctionWrapper)
	# some complex code with non-trivial inference
	# somewhere inside this code there is this loop
	for r in vector
		test(r) && push!(results, r)
	end
	return results
end

and then happily eliminate latency with precompile(build_results!, (Vector{Float64}, FunctionWrapper{Bool, Tuple{Float64}})) without paying the runtime price.

I still have to experiment with this possibility, but it’s quite late now. Cheers!

bkamins · August 19, 2021, 5:30am

Hi,

Coming in late, but a brief comment is that “type hiding” is a way to trade off

“having to recompile whole function chain with every new type”
vs
“having to compile almost everything once and a very small function for every new type” at the cost of boxing and potentially dynamic dispatch (I write potentially as it depends on with how many types you actually invoke the function - Julia compiler has an adaptive rule here)

So essentially when you have code e.g.

function f(var)
    # a lot of preprocessing
    # core operation that are expensive
end

you change it into code

f1(@nospecialize var) = f2(wrap(var))

function f2(wrapped_var)
    # a lot of preprocessing
    f3(unwrap(wrapped_var))
end

function f3(var)
    # core operation that are expensive
end

And in this way whenever you get a new type of var:

the original code had to recompile whole f each time
the changed code compiles f1 and f2 only once, but f3 each time it hits a new type, but since f3 is small it will compile fast; the cost is that you have to call wrap and unwrap and calling f3 will be with dynamic dispatch

Now regarding eachrow(df) in DataFrames.jl. This is a type unstable iterator and will be slow. If you need a type stable iterator use Tables.namedtupleiterator(df[!, columns_you_are_going_to_use]) and use a function barrier (this call is kind of unwrap operation in my example above and df is a wrapped container var), The thing is that eachrow will be compiled once and fast, and with Tables.namedtupleiterator and function barrier you get a compilation each time you call it with new columns and the compilation cost gets up with the number of columns you pass.

Topic		Replies	Views
Performance problems when dealing with deliberately type-unstable code. How to use type knowledge better? Performance	33	2069	December 30, 2020
Mysterious memory allocation Performance	16	851	January 24, 2023
Elimination of (unnecessary) runtime dispatch allocations Internals & Design box , allocations , runtime-dispatch	13	1135	May 5, 2024
Function inside struct allocates when referenced New to Julia function	25	1044	November 7, 2023
Extra allocation with `T::DataType`? Performance	17	583	August 19, 2022

Allocation-free type hiding?

Related topics