[RFC/ANN] Restacker.jl: A workaround for the heap-allocated-immutable problem?

I generalized and packaged up a workaround I’ve been using in Transducers.jl (see below for an explanation) to avoid using heap-allocated immutables when they contain some heap-allocated objects. It’s not registered yet, but I wonder if it is a reasonable workaround:

  • Is there already a (packaged) solution like this?
  • Is it a “safe” solution? (Other than the usual caveat of using @generated and lowered IR.)

I’ll register this package if there is no package out there that already does this and if this approach is not completely broken.

Quoting the README:


In Julia (as of 1.4) immutable objects containing heap-allocated objects may not be stack-allocated sometimes⁽¹⁾ and that’s why using something like view can degrade performance substantially. Restacker.jl provides an API

immutable_object = restack(immutable_object)

to put immutable_object in the stack and avoid this performance pitfall.

⁽¹⁾ It seems that this tends to happen when such an object crosses non-inlined function call boundaries. See also this and this discussions in Discourse and also this old PR JuliaLang/julia#18632.

Example

Consider simple computation kernel

@noinline function f!(ys, xs)
    @inbounds for i in eachindex(ys, xs)
        x = xs[i]
        if -0.5 < x < 0.5
            ys[i] = 2x
        end
    end
end

This works great with raw-Array but the performance with viewed array is not great:

julia> using BenchmarkTools

julia> xs = randn(10_000);

julia> @benchmark f!($(zero(xs)), $xs)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.989 μs (0.00% GC)
  median time:      2.033 μs (0.00% GC)
  mean time:        2.189 μs (0.00% GC)
  maximum time:     6.785 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark f!($(view(zero(xs), :)), $(view(xs, :)))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     47.223 μs (0.00% GC)
  median time:      49.227 μs (0.00% GC)
  mean time:        51.072 μs (0.00% GC)
  maximum time:     133.803 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

It turned out that restacking the destination array ys is enough to fix the problem in f! above:

using Restacker

@noinline function g!(ys, xs)
    ys = restack(ys)
    @inbounds for i in eachindex(ys, xs)
        x = xs[i]
        if -0.5 < x < 0.5
            ys[i] = 2x
        end
    end
end

Calling this function on view is now as fast as the raw-Vector version:

julia> @benchmark g!($(view(zero(xs), :)), $(view(xs, :)))
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     2.021 μs (0.00% GC)
  median time:      2.097 μs (0.00% GC)
  mean time:        2.265 μs (0.00% GC)
  maximum time:     6.663 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

Notice the slight increase in the memory consumption. This is because restack re-creates the object in the stack.

See more examples in benchmark/ directory.

How it works

Consider an immutable type:

struct ABC{A,B,C}
    a::A
    b::B
    c::C
end

Then

abc = restack(abc)

is equivalent to

abc = ABC(
    restack(abc.a),
    restack(abc.b),
    restack(abc.c),
)

For mutable object like x :: Array, restack return the input as-is.

In general, restack is an identity function such that

restack(x) === x

Notice the triple-equality ===. It means that restack does not change the behavior of the program while it may benefit run-time performance by sacrificing the memory consumption (slightly) and compile-time.

(Side notes: There is an even more experimental function Restacker.unsafe_restack to re-construct mutable struct as well. This is unsafe because it breaks the identity (===) and breaks the assumption of the code relying on finalize.)

Under the hood, restack on struct types work by directly invoking the new expression. This skips evaluating user-defined constructors and minimizes the run-time overhead.

16 Likes

x-ref more analysis by Valentin on the example above:

https://github.com/JuliaLang/julia/issues/34847

3 Likes

may not be stack-allocated sometimes

To clarify, it’s not that wrappers sometimes can’t be put on the stack; it’s that when when the wrapper doesn’t “escape” (all uses of the wrapper are local to the fully-inlined function), the compiler typically elides creation of the wrapper. That is, it performs all the operations that are written in terms of wrapper operations instead directly on the parent object being wrapped. There’s more information about this here: performance - Unexpected memory allocation when using array views (julia) - Stack Overflow.

The performance hit you noticed goes way beyond the heap-allocated-immutable problem, because it’s interfering with the ability of the compiler to do proper alias analysis. The slowdown is mostly due to the fact that it’s generating worse code, not that it’s allocating a little blob of memory.

I’ll register this package if there is no package out there that already does this and if this approach is not completely broken.

Speaking personally, I don’t think we want Restack.jl as part of our coding lifestyle, because a “do-nothing” package that magically helps performance seems like code-smell. But as a quite surprising discovery prompting an analysis that may improve the compiler (Missing TBAA information on view · Issue #34847 · JuliaLang/julia · GitHub) this is great! Nice detective work by you & Valentin. How on earth did you discover that this helped?

Of course, if this is something that will take a couple of years to fix, then it might be worth registering Restack.jl. But I think it would be much better for the long term to just wait a bit and see whether the fix comes relatively quickly.

17 Likes

I strongly agree with this statement. I was personally even more surprised by https://github.com/tkf/Restacker.jl/blob/master/benchmark/bench_unique.jl where the restack on the return variable allows us to ellide an allocation in the loop (probably because it is no longer escaping the function).
Personally I want to see every case where restack works analyzed and understood so that the Julia compiler gets better and we don’t need “magic” compiler annotations. I have the same feeling about @avx and @simd (the latter is useful due to the semantics of reductions). I am happy to attempt to triage these things and at least give an educated guess on examples where this helps.

13 Likes

I’d personally be curious to how this relates to UnsafeArrays.jl when working with views. From a very high level point of view they both seem to be eliding the allocations when you use views, but would they lead to different speed ups?

I should benchmark…

1 Like

UnsafeArrays tackles the same problem - heap-allocated views (and similar structures containing with array references), but in a different way: UnsafeArrays replaces standard Arrays by pointer-based UnsafeArrays (which are bitstypes and stack-allocated), while protecting the original array from GC. All views, etc., on UnsafeArrays are automatically stack allocated as well. This is very efficient, esp. with highly muli-threaded code, but also somewhat dangerous, because it is the user’s responsibility to ensure that no UnsafeArrays escapes it’s @uviews environment (which protects the original Array from GC).

I don’t have a mechanism in UnsafeArrays that rewrites existing structs (like _restack) does (but I’ve been thinking about it). Is is possible to add @uviews support to a custom struct by implementing UnsafeArrays.unsafe_uview, though (ArraysOfArrays implements this interface, for example). Maybe Restacker could offer a similar interface for types that are too complex for the automatic _restack, and need custom code?

Long-term I hope both UnsafeArrays and Restacker (nice approach) will just be stop-gaps, of course, until Julia supports stack-allocation of immutables that contain heap-references. As core-counts continue to increase, that will become more and more important. I believe there’s been some progress with this, recently, is that correct?

2 Likes

I totally agree with this. It is exactly why I initially called this function darkritual; I am sacrificing my sanity to get some performance improvements. I think this is an unfortunate reality that you sometimes need this kind of do-nothing “nudge” to the compiler (e.g., FastClosures.jl). I do want to throw away this kind of hacks as soon as possible . But if I can write “zero-cost” abstractions with a little hack like this, I think the situation is not so horrible.

By the way, fortunately for us, there is an open PR to fix (some of?) the problem:

So let’s hope that we can forget about it pretty soon.

As @vchuravy mentioned above, there is a (possibly) different class of problem “solvable” by restack. Here, just re-creating Tuple{Set{Int}, Vector{Int}} at the end of the function somehow helps the compiler generating a better code. It seems this case is more related to your post on StackOverflow?

Thanks, that’s a good point. I think UnsafeArrays.jl is a more direct solution if you know you are working on arrays. Restacker.jl is more general as it works with any objects (e.g., there is an example with Set). But, due to this generality, I imagine it is likely less powerful than UnsafeArrays.jl in some situations.

5 Likes

Does https://github.com/JuliaLang/julia/pull/42465 solve the problems listed in this thread?


Edit: @foobar_lv2 are your issues from Immutables with reference-fields: Why boxed? - #19 by foobar_lv2 solved?

As far as I understand, this was already resolved in Julia v1.5. The NEWS entry cites #34126 #33886 as relevant to the improvement.

You can always check it yourself to see if it still matters. The benchmark unique is fixed by Julia 1.5. But benchmark filter_map still shows the difference as of current Julia master:

julia> VERSION
v"1.8.0-DEV.1429"

julia> include("benchmark/benchmarks.jl")

julia> result = run(SUITE; verbose = true)
...
2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "norestack" => Trial(4.380 μs)
          "restack" => Trial(4.440 μs)
  "filter_map" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "norestack" => Trial(42.820 μs)
          "restack" => Trial(5.839 μs)

Edit: As a slightly different and more “real world” example, I revisited this recently and the benchmark in https://github.com/JuliaFolds/Transducers.jl/pull/504 shows that the previous observation https://github.com/JuliaFolds/Transducers.jl/pull/459 still holds.

2 Likes

I don’t know anything about this or the vocabulary, so I want to ask how restack is improving the performance of filter_map.

Are some immutables being heap-allocated in the norestack version? I know immutables are typically stack/inline-allocated since v1.5, but I never completely ruled out the compiler deciding to heap-allocate in extreme cases, is this what’s happening?

What is a pointer-free type, is that like an isbitstype?

If possible, try to dumb it down a lot for me, I really don’t know anything about the compiler or how it can be tweaked.

For further analysis please see:

https://github.com/JuliaLang/julia/issues/34847

2 Likes