How to support passing temporary buffers around

Lilith · November 3, 2021, 2:45pm

Right now if a user wants to re-use a pre-allocated buffer for sorting multiple data-sets of similar size and type, they must dig into sorting internals (e.g. calling Sort.sort!(v, alg, ord), which bypasses the automatic ord construction from keyword arguments and the count-sort special case). Neither sort.(v::Vector{Vector{Int}}), sort!.(v::Vector{Vector{Int}}), nor even sort!.(v::Matrix{Int}, dims=1) re-use temporary buffers, likely because of how messy it is to pass them around right now.

Can we do better in sorting and other cases where temporary storage is sometimes or always needed?

Can the compiler/gc/other best-effort internals typically detect this re-use make it just as fast as manually passing the buffer?

Can we create a unified elegant system for passing temporary buffers in Base that package developers can use as a model?

Lilith · November 3, 2021, 2:51pm

If we do change the convention, I think it should be to indicate which mutating function are allocating, not non-allocating. But I don’t think we should change the convention. I think the extra name convention is probably unnecessary, and could be misleading. Folks should benchmark if they want to know how fast a function is. Allocations may or may not be an issue depending on their use case.

That said, I still want the performance improvements from passing buffers around in base.

tbeason · November 3, 2021, 4:02pm

Seems like good functionality to have. I can remember 2 distinct times where I would have wanted this. That being said, I think that https://github.com/JuliaCollections/SortingAlgorithms.jl is the right place for it.

Lilith · November 3, 2021, 4:04pm

Were both times sorting related? Does anyone have experience wanting this in a domain other than sorting?

tbeason · November 3, 2021, 4:06pm

Yes, of course. Isn’t that specifically what your post is about? Re-using temporary buffers when doing frequent or repeated sorting

tbeason · November 3, 2021, 4:10pm

Actually, it looks like https://github.com/JuliaCollections/SortingAlgorithms.jl might already have this functionality? I’ve only just browed the code but it looks like the ts keyword argument to sort! might be doing just that?

lmiq · November 3, 2021, 4:52pm

A similar thing happens in linear algebra, where to have non-allocating functions one needs to call BLAS routines more directly, passing the auxiliary arrays.

Lilith · November 3, 2021, 5:27pm

I’m trying to also address other similar cases (edited OP for clarity on this)

Yes, that is what I referred to in

Unfortunately, the ts doesn’t actually get carried all the way to the public facing interface.

RoyiAvital · November 4, 2021, 5:48am

I wish it was a widely used convention.
Break a function into an initialization procedure which allocates the required data and the function itself which doesn’t allocate.
Then any repeating calls will be non allocating.

roflmaostc · November 5, 2021, 1:46am

For example, I often do

function prepare_f(x)
    y = similar(x)

    function g(x)
          # fancy operation uses y
          return y
     end
    return g
end

Then I can repeatedly apply g very efficiently in some optimization etc.

Lilith · November 5, 2021, 3:53am

Fun pattern! I wonder how good of a general solution it is (e.g. for use in Base)

Does this suffer from the issue that y’s type cannot be inferred because the compiler can’t figure out that g doesn’t change y’s type, and if so, what is the performance penalty? I suspect this could be a big deal if y is referenced a lot in g because y will have to be boxed.

(btw,

Should this be return x or return z? )

roflmaostc · November 5, 2021, 3:54pm

You’re right, such patterns can lead to problems!

For example:

function gen_f(x)
    if true
        y = x .+ 1
    end 

    function f(a)
        a + y 
    end 
    return f
end

function gen_f_safe(x)
    if true
        y = x .+ 1
    end 

    f = let y = y 
        function f(a, y=y)
            a + y 
        end 
    end 
    return f
end

REPL:

julia> f = gen_f(10)
^[[Af (generic function with 1 method)

julia> @code_warntype f(1)
Variables
  #self#::typeof(f)
  a::Int64
  y::Union{}

Body::Any
1 ─ %1 = Core.getfield(#self#, :y)::Core.Box
│   %2 = Core.isdefined(%1, :contents)::Bool
└──      goto #3 if not %2
2 ─      goto #4
3 ─      Core.NewvarNode(:(y))
└──      y
4 ┄ %7 = Core.getfield(%1, :contents)::Any
│   %8 = (a + %7)::Any
└──      return %8

julia> f_safe = gen_f_safe(10)
(::var"#f#2"{Int64}) (generic function with 2 methods)

julia> @code_warntype f_safe(1)
Variables
  #self#::var"#f#2"{Int64}
  a::Int64

Body::Int64
1 ─ %1 = Core.getfield(#self#, Symbol("#38#y"))::Int64
│   %2 = (#self#)(a, %1)::Int64
└──      return %2

The cool thing is, that you can also change the buffer (if you want to) in f_safe.

For me the big advantage is, that the user does not need how to create the buffer.
Sometimes the buffer is non-obvious. Think about operations using broadcasting or rfft where the buffer has a different shape or eltype. On the other hand, f_safe might allocate a lot of memory.

Lilith · November 5, 2021, 4:48pm

By this I’m assuming you mean “own” i.e. as we keep f_safe around the memory will be in use.

If there are three two different functions f, and g, both of which mutate an array but require temporary storage similar to their input, and I want to compute f(g(x)) for two different x of the same type and size, naively, I’ll allocate 4 times, with gen_f_safe or gen_f I’ll still allocate twice, while manual temporary storage passing can drop allocations to 1.

Perhaps some sort compiler pass or macro @reuse_temporary that can convert

temp1 = similar(x)
f(x, temp=temp1)
#temp 1 is never used again
temp2 = similar(x)
g(x, temp=temp2)

to

temp1 = similar(x)
f(x, temp=temp1)
temp2 = temp1 # or, as necessary, reinterpret(temp1) and maybe even resize!ing?
f(x, temp=temp2)

Which would let me use Base’s existing sorting methods without this problem:

@time sort!(rand(1000,1000), dims=1; alg=QuickSort); #2k allocs
@time sort!(rand(1000,1000), dims=1; alg=MergeSort); #5k allocs b.c. MergeSort takes temporary array

cjdoris · November 7, 2021, 8:27am

Couldn’t we just add a keyword argument to whatever functions with some conventional name whose type is a Vector{UInt8} and which the function is free to resize, reshape and reinterpret however it needs to.

The caller can then preallocate an empty such vector and pass it in to repeated calls without needing to know how it will be used.

fverdugo · November 7, 2021, 11:12am

Hi @Lilith

In the Gridap project we have ended with a systematic way of allocating buffers for function evaluation and also for array indexing. Take a look at this preprint for all specific details: [2109.12818] The software design of Gridap: a Finite Element package based on the Julia JIT compiler

For calling functions (or any callable object) we introduced this syntax:

cache = return_cache(f,x...)
fx = evaluate!(cache,f,x...)

for a callable object f on the arguments in x.

By default we define

return_cache(f,args...) = nothing
evaluate!(::Nothing,f,args...) = f(args...)

So, any callable object can be called via the exteded syntax and by default we do not use the cache and simply call the function on the given arguments as usual. However, we have opened the door to reuse data at each function call by specialising functions return_cache and evaluate! for particular cases.

For instance, consider this function f(x,y) = [x,y], just as a very simple example. If you don’t want to generate a new vector at each function call, you can define:

function return_cache(::typeof(f),x,y)
 [x,y]
end
function evaluate!(cache,::typeof(f),x,y)
  cache[1] = x
  cache[2] = y
  cache
end

and then you will get:

cache = return_cache(f,1,2) # New vector created here
fx = evaluate!(cache,f,10,20) # No vector created here
@assert fx == [10,20]
fx = evaluate!(cache,f,100,200) # No vector created here
@assert fx == [100,200]

This is essentially the idea. Note that this is thread safe since you can generate an independent buffer for each thread by calling return_cache. Note also that one needs to be aware that the result fx potentially takes ownership of some part of the state of the cache. So, one might need to generate several caches e.g.

c1 = return_cache(f,1,2)
c2 = return_cache(f,1,2)
for ((x1,y1),(x2,y2)) in zip([(1,2),(10,20)],[(10,20),(100,200)] )
  fx1 = evaluate!(c1,f,x1,y1)
  fx2 = evaluate!(c2,f,x2,y2)
  @assert fx1 == 10*fx2
end

If we consider just a single cache, then fx1 and fx2 would be the same object.

This mechanism is pretty useful for our use cases and we are happy with it after been using it for a while.

Lilith · November 7, 2021, 3:14pm

@cjdoris I really like your idea! I think this could be a good convention. The name used in sorting internals right now is ts, but I think it might need to change if it becomes public facing. workspace , temp, scratch_space, and scratch seem viable.

Perhaps resizing down should be discouraged in order to better support this:

temp = UInt8[]
sort!(big, temp=temp)
for small in smalls
    sort!(small, temp=temp)
end
sort!(big2, temp=temp)

Lilith · November 7, 2021, 3:38pm

Hi @fverdugo,

What advantages does your method present over @cjdoris’s? As I see it, cjdoris’s Vect{UInt8}s can also be shared between any two functions that follow the convention while yours’ only allows sharing between functions that have the same cache type. Is the cache. From a user perspective, I think this is

function sort_two_vectors(vec1::T, vec2::T) where T <: AbstractVector
    cache = return_cache(sort!, vec, rev=true, alg=MergeSort)
    evaluate!(cache, sort!, vec1, rev=true, alg=MergeSort)
    evaluate!(cache, sort!, vec2, rev=true, alg=MergeSort)
end

is more boilerplate, more new functions, harder to understand than, and even less powerful than

function sort_two_vectors(vec1, vec2)
    workspace = UInt8[]
    sort!(vec1, rev=true, alg=MergeSort, workspace=workspace)
    sort!(vec2, rev=true, alg=MergeSort, workspace=workspace)
end

I prefer to guarantee not to do this. We have mutation and the ! convention to support returning something that owns part of the input of a function, and I think temporary workspaces can exist separate from that. In your toy example, we can instead define:

function f!(cache, a, b)
  cache[1] = a
  cache[2] = b
  cache
end

and then you will get

cache = f(1,2) # New vector created here
fx = f!(cache,10,20) # No vector created here
@assert fx == [10,20]
fx = f!(cache,100,200) # No vector created here
@assert fx == [100,200]

fverdugo · November 7, 2021, 5:56pm

I think that the key part is return_cache. It automatically generates the buffer for a given function f. This works for any function f without knowing which this function is. In general the buffer can be way more complex than a vector of ints and each function can work on a different type of buffer.

Lilith · November 7, 2021, 6:05pm

You’re mixing the terms “buffer” and “cache”. What exactly to you mean? I’m trying to talk about scratch space in memory that a function needs read and write to in order to compute its answer.

Lilith · November 7, 2021, 6:15pm

This sort of scratch space is used in Julia’s sort and sort! and in BLAS routines. It is also referred to as “auxiliary storage”.

Do you have an example of where this type of storage couldn’t be reinterpreted from a Vector{UInt8} and where functions would have to use mutually incompatible types of auxiliary storage?

Topic		Replies	Views
What is the best way to re-use a temporary vector Performance performance , memory-allocation	16	338	October 22, 2024
How to preallocate and reuse buffers for oft-repeated computation? General Usage question , performance , memory-allocation , profiling , speed-optimization	15	1522	July 12, 2023
Non-escaping array buffer re-use and the sufficiently smart compiler Internals & Design	11	948	February 21, 2021
Fast permutation vector: code needed Performance sorting	38	643	March 21, 2023
How is it possible that parallelized code causes fewer allocations? Performance parallel , sort , memory-allocation	19	385	December 30, 2024

How to support passing temporary buffers around

Related topics