Using generated functions to preallocate memory?

This feels like a terrible hack:

@noinline DimensionMisMatchError(M,N,P) = throw("Dimensions $M, $N, and $P are not equal.")
@generated function preallocated_quadform(A::AbstractMatrix{T}, x::AbstractVector{T}) where {T}
    y = Vector{T}(undef,0)
    quote
        M, N = size(A)
        P = length(x)
        M == N == P || DimensionMisMatchError(M,N,P)
        y = $y
        resize!(y, N)
        mul!(y, A, x)
        y' * x
    end
end

function quadform(A, x)
    M, N = size(A)
    P = length(x)
    M == N == P || DimensionMisMatchError(M,N,P)
    y = A * x
    y' * x
end

…but, is it safe?

Is the memory allocated in the generated function protected from the garbage collector, until the method gets overwritten?

It still works after calling GC.gc(), so the memory appears to be properly protected. But as far as I recall, I haven’t seen anyone doing this before, and it seems like such a convenient technique to allocate memory only when needed.
Although, you wouldn’t want to have a lot of methods with large arrays just eating up your memory.

An example application would be to allocate the blocks in JuliaBLAS, although it currently uses the approach of defining global constants and unsafe_wrapping them.

One problem I imagine would be that this is not threadsafe.

This is an interesting question — the manual does not talk about locally generated values, so they are not explicitly forbidden. None of the conditions are explicitly violated.

That said, I don’t think it is an improvement over over using const globals. It is undefined whether a new Vector{T} is allocated for each function call, or just once (I think the compiler is free to do either).

It would be interesting to check what happens when this function is called from different threads. Maybe such a function is not thread safe.

I am wondering about the actual advantages over allocating a temporary array. That is, how much do you save? The temp is short-lived, so one would naively hope that generational gc manages to reclaim the alloc quickly. Have you tested?

Otherwise, I’d guess an init function (because precompilation) that mallocs the buffer would be an alternative. If your blocks are reasonably large, and you want thread-safety, then you could also mmap the buffer during initialization (one buffer for every thread). This way, you limit memory consumption: The buffers only eat a handful of bytes in the kernel until they are actually used and the kernel faults them in; and if your function is never called from other threads, then you only fault in a single buffer.

By using a julia-allocated undef-array you rely on array.c and libc heuristics/thresholds for whether the memory consumption is lazy (good: almost free until faulted in) or eager (bad: julia/libc might decide during initialization/compilation that there is a juicy spot of already faulted-in memory, and then your function never gets called at runtime and you wasted all this sweet memory; terrible to reproduce, because dependent on operating system version and load/init/compile/compute order). Additionally this gives you full control over offsets (annoying false sharing; reproducible offsets make this reproducible) and you can share buffers between Float32 and ComplexF64 operations.

PS. The invocations could be

julia> fd=ccall(:memfd_create, Csize_t, (Cstring, Cuint), "foo", 0)
0x0000000000000011
julia> ccall(:ftruncate, Cint, (Cint, Csize_t), fd, 1000)
0
julia> _handle=Base.OS_HANDLE(fd)
RawFD(0x00000011)
julia> _io=open(_handle)
IOStream(<fd 17>)
julia> a1=Mmap.mmap(_io, Matrix{Int}, (4, 4)); a2 = Mmap.mmap(_io, Matrix{Float64}, (4, 4));
julia> a1[1]=1; a2[2]=3.0; @show pointer(a1), pointer(a2);
(pointer(a1), pointer(a2)) = (Ptr{Int64} @0x00007f4581673000, Ptr{Float64} @0x00007f4581672000)

julia> a1
4×4 Array{Int64,2}:
                   1  0  0  0
 4613937818241073152  0  0  0
                   0  0  0  0
                   0  0  0  0

julia> a2
4×4 Array{Float64,2}:
 4.94066e-324  0.0  0.0  0.0
 3.0           0.0  0.0  0.0
 0.0           0.0  0.0  0.0
 0.0           0.0  0.0  0.0

Alternatively, one can use a single mapping and use unsafe_wrap. The above bypasses aliasing detection (different virtual memory adresses that are mapped to the same page). Memory mapping games are useful for persistent datastructures: You can create new copy-on-write mappings to the underlying fd. Unfortunately it is very hard on linux to create new cow-mappings to some range of virtual memory, cf https://github.com/JuliaLang/julia/pull/31630.

If you use the unsafe_wrap route, then you can also use Mmap.Anonymous(). For some reason, mmap(::Mmap.Anonymous, args...) needs explicit offset=0; I should probably file a bug for that.

1 Like

One condition states: “Generated functions … must be completely pure. Due to an implementation limitation, this also means that they currently cannot define a closure or generator.”

I would think that including $y in the quoted code counts as a closure, so is not allowed.

In any case, this is almost certainly not thread safe. That would require functions to be generated once per thread.

1 Like

There’s no need for a generated function here: you can simply create a function that closes over the buffer:

julia> const f = let buffer = zeros(5)
         function (x)
           buffer .+= x
           sum(buffer)
         end
       end
#7 (generic function with 1 method)

julia> f(1)
5.0

julia> f(2)
15.0
3 Likes

I guess the advantage of the generated function would be that it creates a new buffer for each type.

1 Like

That said, I wouldn’t necessarily recommend this approach. Is it too much of a burden in your use case to simply create a Workspace struct that the user can provide if they want to use pre-allocated memory? Keeping the buffer explicit and separate means that when you eventually want to go back and make this code thread-safe, you’ll be able to do so easily.

2 Likes

I am not sure — that paragraphs is a bit ambiguous (it talks about globals).

That said, I would not recommend this approach, for various reasons. However, in general, having a solution for this problem would be nice; something along the lines of cl:load-time-value.

Thanks everyone for the responses. I’m sold on this being a bad idea compared to alternatives.

My use case was working with sized mutable structs wrapping NTuples, like MArrays.
What I have been doing is @inline-ing every called function to let them be allocated on the stack. But I also think inlining everything is a bad idea. With larger objects, it also seems slower than the interpolating-into-a-generated-function approach.

I think I am going with the const global approach, combined with array types wrapping pointers (ie, carrying size information).

I will have to look into mmap to see how to properly use it.
My thought with using simple Vector{UInt8} or Vector{Vector{UInt8}} for multithreading were that I could initialize them with a length of 0. Assuming resize! will never actually free memory, I’d just have every function using the buffer resize! it first to make sure it’s large enough (if it might free memory, I’d instead check the length first, and only resize! if it will grow the buffer).
This doesn’t sound like a good plan for avoiding false sharing.

Getting to reuse this buffer across different sizes and element types, as well as being more likely to keep it hot in the cache are major advantages…but then, I’d have to be very careful in constraining what functions may actually use this so pointers don’t overlap. If I have to start managing lifetimes, the complexity would probably not be worthwhile.

My intended use case is for evaluating probability distributions and their gradients.
I have a macro that interpreters a statistical model, and does source to source reverse diff for use with DynamicHMC.jl.
I’m trying to figure out a good way to manage memory for these models. I would prefer not to force someone else to pass in a Workspace themselves. Maybe I should take that approach, but it would be nice to automate it.

1 Like

ForwardDiff used this method for a cache a couple of years ago. I think it worked kind of ok but it always felt fishy. The intention is basically to define a separate global variable for each method specialization without having to beforehand do a loop overall input types and eval the method and the constant.

Afaik resize! currently does not shrink the allocation, but sizehint! does.

However: A simple Vector{UInt8} does not have the right type, and if you unsafe_wrap then the resize becomes problematic.

But you could mmap a handful giant buffers during init and unsafe_wrap into them at runtime (never resize). Then, only the handful of bytes for the jl_array get allocated per invocation. The size of the mmap is mostly irrelevant until the memory gets faulted in, so feel free to overallocate generously (unless your OS decides to use giant pagesize).

Even if you don’t care about this now, globals have a clear route towards thread-safety.

I would not worry about that overly much. But if this happens, then it is nice to not rely on external allocators that don’t give you exact control over placemnt. Example: John McCalpin's blog » Blog Archive » SC18 paper: HPL and DGEMM performance variability on Intel Xeon Platinum 8160 processors

I also use generated functions in Measurements.jl to preallocate an empty structure, here is a related discussion: Question about use of generated functions. The utility of generated function compared to a constant is that I can preallocate the structure for any arbitrary type.

1 Like

Seems somewhat excessive to me to put the whole body inside the generated function instead of just the allocation (for example, with y = (@generated) ? Vector{T}() : Vector{T}()), but otherwise seems valid to me.

Cool. It would be handy to define

@generated preallocate(::Type{T}) where T = Vector{T}(undef, 0)

when I need it (so that I can minimize the body in @generated and also re-use it in different places). Extending it a bit, this

@generated preallocate(::Type{T}, key) where T = Vector{T}(undef, 0)

may be handy for preallocating distinct buffers of the same type. Can I use it to get different buffer for different key type? That is to say, does preallocate(Int, Val(:a)) !== preallocate(Int, Val(:b)) hold always? It seems it does ATM. I wonder if Julia compiler can become smart enough to know that key is not used so that it generates the same function for these two cases. Although maybe the compiler should be allowed to do that as the body of @generated has to be pure?

It can ignore key, since that’s unused.

Okay. Thanks for the clarification.