What is the best way to re-use a temporary vector

Hi,

I have the following situation: I have a function f(x::Float64) which creates a temporary array A::Vector{Float64}, then finally it returns a Float64. I typically call this function millions of times, so I am wondering if there is a way to allocate a single vector A and pass it to each function call efficiently.


MWE : The dnPl function from LegendrePolynomials.jl uses a cache to calculate the n-th derivative of a Legendre polynomial. Here’s an example:

using BenchmarkTools, LegendrePolynomials

@btime dnPl(.5,10,4)
#  61.842 ns (2 allocations: 112 bytes)

cache=zeros(10) #temporary cache
@btime dnPl(.5,10,4,$cache)
#  49.053 ns (0 allocations: 0 bytes)

In my code, I need to evaluate dnPl for many inputs. For example

function without_cache(n) 
    A=zeros(n,10) 
    for i=1:n,j=1:10
        x=-1+2i/n
        A[i,j]=dnPl(x,j,4)
    end 
    A 
end

@btime without_cache(30)
#  22.913 ΞΌs (693 allocations: 25.42 KiB)

The number of allocations seem to be consistent with the previous example: (2 allocations per dnPl call) times 300, plus some overhead. So, it tried to create a local cache. However, I still see too many allocations

function with_cache(n) 
    A=zeros(n,10) 
    cache=zeros(10)
    for i=1:n,j=1:10
        x=-1+2i/n
        A[i,j]=dnPl(x,j,4,cache)
    end 
    A 
end

@btime with_cache(30)
#   19.767 ΞΌs (275 allocations: 6.81 KiB)

Do you have any ideas on how to re-use the β€œcache” without having these allocations?

Use a callable object? (For example: FinEtoolsFlexStructures.jl/src/FEMMShellT3FFModule.jl at 9b70a29592010aa9edeabc8f47aa348ebadc223b Β· PetrKryslUCSD/FinEtoolsFlexStructures.jl Β· GitHub)

Actually, I think dnPl allocates.

Unrelated to your question about allocations, but you probably want to flip the order of your loops (i,j) to access A according to column major order (to avoid cache misses).

You could use Bumper.jl

I might be missing something here but even Bumper.jl seem to produce a similar number of allocations (I could be misusing though)

function with_bumper(n)
    A=zeros(n,10) 
    @no_escape begin 
        cache=@alloc(Float64,10)
        for j=1:10, i=1:n
            x=-1+2i/n
            A[i,j]=dnPl(x,j,4,cache)
        end 
    end #no_escape
    A 
end

@btime with_bumper(30)
# 18.692 ΞΌs (273 allocations: 6.67 KiB)

It could be the case that dnPl is allocating as @PetrKryslUCSD mentioned. However, I did not see any allocation (see my original post).

1 Like

It might be that it allocates for some inputs.

Thanks, I went ahead and attempted that: There are no allocations when I am simply calling the object but the problem happens when I have the cache outside of a loop:

mutable struct my_dnPl
    x::Float64
    n::Int64
    k::Int64 
    cache::Vector{Float64}
end
function (p::my_dnPl)() 
    return dnPl(p.x,p.n,p.k,p.cache) 
end

cache=zeros(10)
@btime my_dnPl(.4,10,3,$cache)()
#  63.603 ns (0 allocations: 0 bytes

)
function with_callable_obj(n)
    A=zeros(n,10) 
    cache=zeros(10)
    for i=1:n,j=1:10
        x=-1+2i/n
        A[i,j]=my_dnPl(x,j,4,cache)()
    end 
    A 
end


@btime with_callable_obj(30)
 # 21.728 ΞΌs (575 allocations: 20.88 KiB)

When struct is used instead of mutable struct I still see 275 allocations. So, maybe dnPl is allocating somewhere that I can’t see in the code.

Having said that, I have found a workaround using another function that the same package that solves this problem, which I will include below in case someone runs into the same issue

function with_collect(n) 
    A=zeros(n,10) 
    cache=zeros(10+1) #collectdnPl also returns P_0
    for i=1:n 
        x=-1+2i/n
        collectdnPl!(cache,x,lmax=10,n=4)
        A[i,:]=@views cache[2:end] 

    end
    A
end

@btime with_collect(30) 
#  2.204 ΞΌs (5 allocations: 2.59 KiB)

It does seem like dnPl allocates, though not very much or often:

julia> function with_cache(n)
           A=zeros(n,10)
           cache=zeros(10)
           for i=1:n,j=1:10
               @timeit "x" x=-1+2i/n
               @timeit "A" A[i,j]=dnPl(x,j,4,cache)
           end
           A
       end
with_cache (generic function with 1 method)

julia> @btime with_cache(30);
  83.917 ΞΌs (272 allocations: 6.98 KiB)

julia> print_timer()
 ────────────────────────────────────────────────────────────────────
                            Time                    Allocations
                   ───────────────────────   ────────────────────────
 Tot / % measured:      14.3s /  23.9%            475MiB /  57.8%

 Section   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────
 A          20.0M    2.97s   86.8%   149ns    274MiB  100.0%    14.4B
 x          20.0M    454ms   13.2%  22.7ns     0.00B    0.0%    0.00B
 ────────────────────────────────────────────────────────────────────

How to speed up loops in Matrix

Slow for big matrices

# Slow way, first loop in col then loop in row
function incmatrixSLOW(a)
    (m,n) = size(a)
    for row=1:m , col=1:n
        a[row,col] += 1
    end
end


Fast for big matrices

# fast way, first loop in row then loop in col
function incmatrixFAST(a)
    (m,n) = size(a)
    for col=1:n , row=1:m 
        a[row,col] += 1
    end
end

Fast by getting julia to do it for you

function incmatrixFAST(a)
    (m,n) = size(a)
    for k = eachindex(a) 
        a[k] += 1
    end
end
1 Like

You can fix that via

julia> @eval LegendrePolynomials function doublefactorial(::Type{T}, n) where T
           p = convert(T, one(n))
           for i in n:-2:1
               p *= convert(T, i)
           end
           convert(T, p)
       end

in order to replace the original

function doublefactorial(T, n)
           p = convert(T, one(n))
           for i in n:-2:1
               p *= convert(T, i)
           end
           convert(T, p)
       end

Somebody might need to do a PR on that for LegendrePolynomials.jl.

However, I would much more interested in why this goes wrong here and getting it fixed in the compiler!

This is clearly a failure of constant-propagation / inlining / specialization / type-inference. You don’t see it by benchmarking dnPl in isolation because this has different state of inlining / const-prop heuristics.

The bad pattern – using T as an argument when you really mean ::Type{T} – is very commonly used. It would be better if we could make it reliably do the right thing.

PS. I looked at

julia> import Profile
julia> with_cache(30); Profile.Allocs.clear();Profile.Allocs.@profile with_cache(30); p = Profile.Allocs.fetch(); Profile.print(p)
Overhead β•Ž [+additional indent] Count File:Line; Function
=========================================================
   β•Ž432 @Base/client.jl:541; _start()
   β•Ž 432 @Base/client.jl:567; repl_main
   β•Ž  432 @Base/client.jl:430; run_main_repl(interactive::Bool, quiet::Bool, banner::Symbol, history_file::Bool, color_set::Bool)
   β•Ž   432 @Base/essentials.jl:1052; invokelatest
   β•Ž    432 @Base/essentials.jl:1055; #invokelatest#2
   β•Ž     432 @Base/client.jl:446; (::Base.var"#1139#1141"{Bool, Symbol, Bool})(REPL::Module)
   β•Ž    β•Ž 432 @REPL/src/REPL.jl:469; run_repl(repl::REPL.AbstractREPL, consumer::Any)
   β•Ž    β•Ž  432 @REPL/src/REPL.jl:483; run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::Any)
   β•Ž    β•Ž   432 @REPL/src/REPL.jl:324; kwcall(::NamedTuple, ::typeof(REPL.start_repl_backend), backend::REPL.REPLBackend, consumer::Any)
   β•Ž    β•Ž    432 @REPL/src/REPL.jl:327; start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function)
   β•Ž    β•Ž     432 @REPL/src/REPL.jl:342; repl_backend_loop(backend::REPL.REPLBackend, get_module::Function)
   β•Ž    β•Ž    β•Ž 432 @REPL/src/REPL.jl:245; eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module)
   β•Ž    β•Ž    β•Ž  432 @Base/boot.jl:430; eval
   β•Ž    β•Ž    β•Ž   432 REPL[3]:6; with_cache(n::Int64)
   β•Ž    β•Ž    β•Ž    432 @LegendrePolynomials/src/LegendrePolynomials.jl:352; dnPl
   β•Ž    β•Ž    β•Ž     432 @LegendrePolynomials/src/LegendrePolynomials.jl:284; _unsafednPl!
 64β•Ž    β•Ž    β•Ž    β•Ž 64  @LegendrePolynomials/src/LegendrePolynomials.jl:236; doublefactorial(T::Type, n::Int64)
368β•Ž    β•Ž    β•Ž    β•Ž 368 @LegendrePolynomials/src/LegendrePolynomials.jl:238; doublefactorial(T::Type, n::Int64)
Total snapshots: 27
Total bytes: 432

From there you can take a look at the relevant source code and you immediately see the code-smell (proper spelling has ::Type{T} as argument, also that should be @inline).

On the other hand, seeing the code in isolation, I would not have expected that to actually create trouble! Especially given that it works fine when you benchmark dnPl in isolation.

3 Likes

There is no the right thing, one has to be wary of both over- and under- specialization while programming. In fact, what I’d like to see is more support for explicit control over what gets specialized. Relevant issue:

1 Like

Yes. And in this specific case, it is very obvious that the LegendrePolynomials.jl author meant ::Type{T}, i.e. wanted to specialize.

But inlining + const-prop does the right specialization despite their mistake in many contexts like e.g. @btime dPnl($x, $n, $ell, $cache). So we end up with a situation where reasonable-ish testing does not uncover a serious performance bug in a library :frowning:

I’m not sure how to approach this. Maybe we need a command-line switch / build-time switch for debug/testing that only gives you forced specializations? Such that context can never hide such a bug?

The specialization rules are full of weirdness already, like the magic rule that β€œif the function is called, then it is specialized”. Sure.

Why not β€œif a function body contains a loop, then it is specialized”?

Or why not get rid of the magic β€œdon’t specialize on <:Function or <:Type” rule completely?

I’m not saying that always specializing is the right thing. It’s more a question of which bugs happen for which people and how hard they are to spot.

My guess is that the people who would need many more @nospecialize annotations are the ones who are more aware of Performance Tips Β· The Julia Language than the people who today need to force specializations in their code.

2 Likes

This fixes the issue, thank you. I will run a few tests, then I will submit a PR. I have a question though: What does @eval do here?

It evaluates the code block inside the module LegendrePolynomials. Basically exactly as if you had typed it inside the package’s code. Thus it overwrites the original definition.

I am not sure though what the difference is to just overwriting the method

julia> function LegendrePolynomials.doublefactorial(::Type{T}, n) where T
           p = convert(T, one(n))
           for i in n:-2:1
               p *= convert(T, i)
           end
           convert(T, p)
       end
4 Likes

I’m not sure if these are related statements since this function returns an array of size (n, 10) instead of a scalar. However, whenever I see an array being allocated to store highly-structured contents, my mind immediately goes to using Iterators to eliminate allocations. For the function above, this might look something like

function with_iterators(n) 
    i = 1:n
    j = 1:10
    ij = Iterators.product(i, j)
    Iterators.map((i,j) -> dnPl(-1 + 2i/n, j, 4), ij)
end

which should produce an iterator that lazily evaluates the dnPl call at getindex time instead of allocating memory to store all of the results.

If you’re going to immediately reduce the array anyway, like the function you mentioned at the top where ultimately f = x::Float64 -> y::Float64, you can then perform the reduction without ever allocating in the first place, e.g. sum(f, with_iterators(n)), reduce(f, with_iterators(n)), etc.

1 Like

Thanks, using Iterators instead of allocating A is definitely better. However, the issue here apparently has to do with dnPl and the way a certain function is declared.