What is the best way to re-use a temporary vector

HMegh · October 21, 2024, 2:24am

Hi,

I have the following situation: I have a function f(x::Float64) which creates a temporary array A::Vector{Float64}, then finally it returns a Float64. I typically call this function millions of times, so I am wondering if there is a way to allocate a single vector A and pass it to each function call efficiently.

MWE : The dnPl function from LegendrePolynomials.jl uses a cache to calculate the n-th derivative of a Legendre polynomial. Here’s an example:

using BenchmarkTools, LegendrePolynomials

@btime dnPl(.5,10,4)
#  61.842 ns (2 allocations: 112 bytes)

cache=zeros(10) #temporary cache
@btime dnPl(.5,10,4,$cache)
#  49.053 ns (0 allocations: 0 bytes)

In my code, I need to evaluate dnPl for many inputs. For example

function without_cache(n) 
    A=zeros(n,10) 
    for i=1:n,j=1:10
        x=-1+2i/n
        A[i,j]=dnPl(x,j,4)
    end 
    A 
end

@btime without_cache(30)
#  22.913 μs (693 allocations: 25.42 KiB)

The number of allocations seem to be consistent with the previous example: (2 allocations per dnPl call) times 300, plus some overhead. So, it tried to create a local cache. However, I still see too many allocations

function with_cache(n) 
    A=zeros(n,10) 
    cache=zeros(10)
    for i=1:n,j=1:10
        x=-1+2i/n
        A[i,j]=dnPl(x,j,4,cache)
    end 
    A 
end

@btime with_cache(30)
#   19.767 μs (275 allocations: 6.81 KiB)

Do you have any ideas on how to re-use the “cache” without having these allocations?

PetrKryslUCSD · October 21, 2024, 2:43am

Use a callable object? (For example: FinEtoolsFlexStructures.jl/src/FEMMShellT3FFModule.jl at 9b70a29592010aa9edeabc8f47aa348ebadc223b · PetrKryslUCSD/FinEtoolsFlexStructures.jl · GitHub)

PetrKryslUCSD · October 21, 2024, 3:17am

Actually, I think dnPl allocates.

carstenbauer · October 21, 2024, 3:37am

Unrelated to your question about allocations, but you probably want to flip the order of your loops (i,j) to access A according to column major order (to avoid cache misses).

Satvik · October 21, 2024, 3:45am

You could use Bumper.jl

HMegh · October 21, 2024, 3:00pm

I might be missing something here but even Bumper.jl seem to produce a similar number of allocations (I could be misusing though)

function with_bumper(n)
    A=zeros(n,10) 
    @no_escape begin 
        cache=@alloc(Float64,10)
        for j=1:10, i=1:n
            x=-1+2i/n
            A[i,j]=dnPl(x,j,4,cache)
        end 
    end #no_escape
    A 
end

@btime with_bumper(30)
# 18.692 μs (273 allocations: 6.67 KiB)

It could be the case that dnPl is allocating as @PetrKryslUCSD mentioned. However, I did not see any allocation (see my original post).

lmiq · October 21, 2024, 3:15pm

It might be that it allocates for some inputs.

HMegh · October 21, 2024, 3:27pm

Thanks, I went ahead and attempted that: There are no allocations when I am simply calling the object but the problem happens when I have the cache outside of a loop:

mutable struct my_dnPl
    x::Float64
    n::Int64
    k::Int64 
    cache::Vector{Float64}
end
function (p::my_dnPl)() 
    return dnPl(p.x,p.n,p.k,p.cache) 
end

cache=zeros(10)
@btime my_dnPl(.4,10,3,$cache)()
#  63.603 ns (0 allocations: 0 bytes

)
function with_callable_obj(n)
    A=zeros(n,10) 
    cache=zeros(10)
    for i=1:n,j=1:10
        x=-1+2i/n
        A[i,j]=my_dnPl(x,j,4,cache)()
    end 
    A 
end


@btime with_callable_obj(30)
 # 21.728 μs (575 allocations: 20.88 KiB)

When struct is used instead of mutable struct I still see 275 allocations. So, maybe dnPl is allocating somewhere that I can’t see in the code.

Having said that, I have found a workaround using another function that the same package that solves this problem, which I will include below in case someone runs into the same issue

function with_collect(n) 
    A=zeros(n,10) 
    cache=zeros(10+1) #collectdnPl also returns P_0
    for i=1:n 
        x=-1+2i/n
        collectdnPl!(cache,x,lmax=10,n=4)
        A[i,:]=@views cache[2:end] 

    end
    A
end

@btime with_collect(30) 
#  2.204 μs (5 allocations: 2.59 KiB)

Satvik · October 21, 2024, 3:29pm

It does seem like dnPl allocates, though not very much or often:

julia> function with_cache(n)
           A=zeros(n,10)
           cache=zeros(10)
           for i=1:n,j=1:10
               @timeit "x" x=-1+2i/n
               @timeit "A" A[i,j]=dnPl(x,j,4,cache)
           end
           A
       end
with_cache (generic function with 1 method)

julia> @btime with_cache(30);
  83.917 μs (272 allocations: 6.98 KiB)

julia> print_timer()
 ────────────────────────────────────────────────────────────────────
                            Time                    Allocations
                   ───────────────────────   ────────────────────────
 Tot / % measured:      14.3s /  23.9%            475MiB /  57.8%

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 A          20.0M    2.97s   86.8%   149ns    274MiB  100.0%    14.4B
 x          20.0M    454ms   13.2%  22.7ns     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────

StevenSiew · October 21, 2024, 9:09pm

How to speed up loops in Matrix

Slow for big matrices

# Slow way, first loop in col then loop in row
function incmatrixSLOW(a)
    (m,n) = size(a)
    for row=1:m , col=1:n
        a[row,col] += 1
    end
end


Fast for big matrices

# fast way, first loop in row then loop in col
function incmatrixFAST(a)
    (m,n) = size(a)
    for col=1:n , row=1:m 
        a[row,col] += 1
    end
end

Fast by getting julia to do it for you

function incmatrixFAST(a)
    (m,n) = size(a)
    for k = eachindex(a) 
        a[k] += 1
    end
end

foobar_lv2 · October 22, 2024, 10:52am

You can fix that via

julia> @eval LegendrePolynomials function doublefactorial(::Type{T}, n) where T
           p = convert(T, one(n))
           for i in n:-2:1
               p *= convert(T, i)
           end
           convert(T, p)
       end

in order to replace the original

function doublefactorial(T, n)
           p = convert(T, one(n))
           for i in n:-2:1
               p *= convert(T, i)
           end
           convert(T, p)
       end

Somebody might need to do a PR on that for LegendrePolynomials.jl.

However, I would much more interested in why this goes wrong here and getting it fixed in the compiler!

This is clearly a failure of constant-propagation / inlining / specialization / type-inference. You don’t see it by benchmarking dnPl in isolation because this has different state of inlining / const-prop heuristics.

The bad pattern – using T as an argument when you really mean ::Type{T} – is very commonly used. It would be better if we could make it reliably do the right thing.

PS. I looked at

julia> import Profile
julia> with_cache(30); Profile.Allocs.clear();Profile.Allocs.@profile with_cache(30); p = Profile.Allocs.fetch(); Profile.print(p)
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
   ╎432 @Base/client.jl:541; _start()
   ╎ 432 @Base/client.jl:567; repl_main
   ╎  432 @Base/client.jl:430; run_main_repl(interactive::Bool, quiet::Bool, banner::Symbol, history_file::Bool, color_set::Bool)
   ╎   432 @Base/essentials.jl:1052; invokelatest
   ╎    432 @Base/essentials.jl:1055; #invokelatest#2
   ╎     432 @Base/client.jl:446; (::Base.var"#1139#1141"{Bool, Symbol, Bool})(REPL::Module)
   ╎    ╎ 432 @REPL/src/REPL.jl:469; run_repl(repl::REPL.AbstractREPL, consumer::Any)
   ╎    ╎  432 @REPL/src/REPL.jl:483; run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::Any)
   ╎    ╎   432 @REPL/src/REPL.jl:324; kwcall(::NamedTuple, ::typeof(REPL.start_repl_backend), backend::REPL.REPLBackend, consumer::Any)
   ╎    ╎    432 @REPL/src/REPL.jl:327; start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function)
   ╎    ╎     432 @REPL/src/REPL.jl:342; repl_backend_loop(backend::REPL.REPLBackend, get_module::Function)
   ╎    ╎    ╎ 432 @REPL/src/REPL.jl:245; eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module)
   ╎    ╎    ╎  432 @Base/boot.jl:430; eval
   ╎    ╎    ╎   432 REPL[3]:6; with_cache(n::Int64)
   ╎    ╎    ╎    432 @LegendrePolynomials/src/LegendrePolynomials.jl:352; dnPl
   ╎    ╎    ╎     432 @LegendrePolynomials/src/LegendrePolynomials.jl:284; _unsafednPl!
 64╎    ╎    ╎    ╎ 64  @LegendrePolynomials/src/LegendrePolynomials.jl:236; doublefactorial(T::Type, n::Int64)
368╎    ╎    ╎    ╎ 368 @LegendrePolynomials/src/LegendrePolynomials.jl:238; doublefactorial(T::Type, n::Int64)
Total snapshots: 27
Total bytes: 432

From there you can take a look at the relevant source code and you immediately see the code-smell (proper spelling has ::Type{T} as argument, also that should be @inline).

On the other hand, seeing the code in isolation, I would not have expected that to actually create trouble! Especially given that it works fine when you benchmark dnPl in isolation.

nsajko · October 22, 2024, 11:26am

There is no the right thing, one has to be wary of both over- and under- specialization while programming. In fact, what I’d like to see is more support for explicit control over what gets specialized. Relevant issue:

github.com/JuliaLang/julia

Decouple dispatch and specialization

opened 02:04AM - 19 May 15 UTC

timholy

julep types and dispatch

aka "Generalize the ANY mechanism" aka "Completely destroy Jeff's life forever" …(continuing the heresy in https://github.com/JuliaLang/julia/issues/4774#issuecomment-103286221 in a direction that might be more profitable, but much harder) What if we could encode an Array's size among its type parameters, but not pay the compilation price of generating specialized code for each different size? Here's a candidate set of rules (EDITED): 1. In the type declaration, you specify the default specialization behavior for the parameters: `type Bar{T,K*}` means that, by default, functions are specialized on `T` but not on `K`. Array might be defined as `type Array{T,N,SZ*<:NTuple{N,Int}}` (in a future world of triangular dispatch). 2. In a function signature, if a type's parameter isn't represented as a "TypeVar", it uses the default behavior. So `foo(A::Array)` specializes the element type and the dimensionality, but not the size. 3. Specifying a parameter overrides the default behavior. `foo{T,K}(b::Bar{T,K})` would specialize on `K`, for example. Conversely, putting \* after a type parameter causes the parameter to be used in dispatch (subtying and intersection) but prevents specialization. This is much like the ANY hack now. `matmul{S,T,m*,k*,n*}(A::Array{S,2,(m,k)}, B::Array{T,2,(k,n)}` would perform the size check at compile time but not specialize on those arguments. I'm aware that this is more than a little crazy and has some serious downsides (among others, it feels like performance could get finicky). But to go even further out on a limb, one might additionally have `matmul{S,T,m<=4,k<=4,n<=4}(...)` to generate specialized algorithms (note the absence of *s) for small sizes and use the unspecialized fallbacks for larger sizes. This way one wouldn't need a specialized `FixedSizeArray` type, and could just implement such algorithms on `Array`. A major downside to this proposal is that calls in functions that lack complete specialization might have to be expanded like this: ``` jl if size(A,1) <= 4 && size(A,2) <= 4 && size(B,2) <= 4 C = matmul_specialized(A,B) else C = matmul_unspecialized(A,B) end ```

foobar_lv2 · October 22, 2024, 12:05pm

Yes. And in this specific case, it is very obvious that the LegendrePolynomials.jl author meant ::Type{T}, i.e. wanted to specialize.

But inlining + const-prop does the right specialization despite their mistake in many contexts like e.g. @btime dPnl($x, $n, $ell, $cache). So we end up with a situation where reasonable-ish testing does not uncover a serious performance bug in a library

I’m not sure how to approach this. Maybe we need a command-line switch / build-time switch for debug/testing that only gives you forced specializations? Such that context can never hide such a bug?

The specialization rules are full of weirdness already, like the magic rule that “if the function is called, then it is specialized”. Sure.

Why not “if a function body contains a loop, then it is specialized”?

Or why not get rid of the magic “don’t specialize on <:Function or <:Type” rule completely?

I’m not saying that always specializing is the right thing. It’s more a question of which bugs happen for which people and how hard they are to spot.

My guess is that the people who would need many more @nospecialize annotations are the ones who are more aware of Performance Tips · The Julia Language than the people who today need to force specializations in their code.

HMegh · October 22, 2024, 12:50pm

This fixes the issue, thank you. I will run a few tests, then I will submit a PR. I have a question though: What does @eval do here?

abraemer · October 22, 2024, 1:10pm

It evaluates the code block inside the module LegendrePolynomials. Basically exactly as if you had typed it inside the package’s code. Thus it overwrites the original definition.

I am not sure though what the difference is to just overwriting the method

julia> function LegendrePolynomials.doublefactorial(::Type{T}, n) where T
           p = convert(T, one(n))
           for i in n:-2:1
               p *= convert(T, i)
           end
           convert(T, p)
       end

mike.ingold · October 22, 2024, 2:51pm

HMegh:

In my code, I need to evaluate dnPl for many inputs. For example
function without_cache(n) 
    A=zeros(n,10) 
    for i=1:n,j=1:10
        x=-1+2i/n
        A[i,j]=dnPl(x,j,4)
    end 
    A 
end

I’m not sure if these are related statements since this function returns an array of size (n, 10) instead of a scalar. However, whenever I see an array being allocated to store highly-structured contents, my mind immediately goes to using Iterators to eliminate allocations. For the function above, this might look something like

function with_iterators(n) 
    i = 1:n
    j = 1:10
    ij = Iterators.product(i, j)
    Iterators.map((i,j) -> dnPl(-1 + 2i/n, j, 4), ij)
end

which should produce an iterator that lazily evaluates the dnPl call at getindex time instead of allocating memory to store all of the results.

If you’re going to immediately reduce the array anyway, like the function you mentioned at the top where ultimately f = x::Float64 -> y::Float64, you can then perform the reduction without ever allocating in the first place, e.g. sum(f, with_iterators(n)), reduce(f, with_iterators(n)), etc.

HMegh · October 22, 2024, 4:24pm

Thanks, using Iterators instead of allocating A is definitely better. However, the issue here apparently has to do with dnPl and the way a certain function is declared.

Topic		Replies	Views
Performance issue due to function as an argument General Usage question , performance	16	851	September 22, 2023
How is it possible that parallelized code causes fewer allocations? Performance parallel , sort , memory-allocation	19	386	December 30, 2024
Common allocation mistakes Performance memory-allocation	47	7176	August 21, 2023
Boltzmann Factor in a loop Performance	38	1813	April 3, 2018
Improving performance elegantly - type stability General Usage type-stability	20	1408	January 13, 2019

What is the best way to re-use a temporary vector

Related topics