Unclear allocation behaviour with built-in sum()

Deceneu · January 28, 2025, 3:18pm

Hello,

Introduction

I’ve been trying to process some data and I’ve encountered crippling allocation when using the built-in sum(). I have some mid-level understanding of how to increase performance in Julia, but regarding the issue presented below I am not sure how I should search the web / analyze the code for missed potential. The usual @code_warntype, @profview and @profview_allocs do not reveal anything to me, maybe they would to somebody more experienced.

Problem Statement

The very short version is that I need to compute a series of arrays A[i, t] which I then use to produce a number of results. Among those results, one of them has the form B[t] = sum(A[i, t]*v[i] for i in 1:n).

Since A[i, t] = f(A[i, t-1]), I’ve initially written a function to compute all of A using @threads to speed up the process. This method works well until you fill all available computer RAM. Thus, I’ve decided to write a version which computes A[i, t] in chunks, which I proceed to use to compute B[t] and then overwrite with the next chunk.

Here’s the problem, when I write the function along the lines presented in the minimal example below, dressing the sum() in while and a for loop, the latter with and without @threads, the memory allocations grow by about ~11x.

MWE

1. using `sum()`

1.1 Full Run

using BenchmarkTools
import Base.Threads: @threads

const n = 128^3
const t = 500
const v = rand(n)

function A_complete(n = n, t = t)
    A = zeros(n, t)
    
    @threads for i in 1:n 
        for τ in 2:t
            A[i, τ] = A[i, τ - 1] + sin(i)
        end
    end

    return A
end

function B_complete(n = n, t = t)
    B = zeros(t)
    
    A = A_complete(n, t)
    @threads for τ in 2:t
        B[τ] = sum(A[i, τ] * v[i] for i in 1:n)
    end

    return B
end

1.2 Benchmark: Full Run

Running this version with the @btime macro shows allocations in line with the expectation set by the preallocation at the start of A_complete(), A = zeros(n, t) where 8 bytes per Float64 in an array of n by t means about 8nt/1024^3 = 7.8125 \mathrm{GiB}, in accordance with the macro output.

julia> @btime B_complete();
   19.605 s (170 allocations: 7.81 GiB)

1.3 Run With Chunks

function A_chunk(n = n; chunk_size, A_previous)
    A = zeros(n, chunk_size)

    @threads for i in 1:n 
        A[i, 1] = A_previous[i] + sin(i)
        
        for τ in 2:chunk_size
            A[i, τ] = A[i, τ - 1] + sin(i)
        end
    end 

    return A
end

function B_chunk(n = n, t = t; chunk_size)
    B = zeros(t)
    
    t_lo, t_hi = 2, chunk_size + 1
    A_previous = zeros(n)
    A = zeros(n, chunk_size)

    while t_hi < t
        A = A_chunk(; chunk_size, A_previous)
        A_previous = A[:, end]

        @threads for τ in t_lo:t_hi
            B[τ] = sum(A[i, τ - t_lo + 1] * v[i] for i in 1:n)
        end

        t_lo, t_hi = t_hi + 1, t_hi + chunk_size
    end    
    
    A = A_chunk(; chunk_size = t - t_lo + 1, A_previous)
    @threads for τ in t_lo:t
        B[τ] = sum(A[i, τ - t_lo + 1] * v[i] for i in 1:n)
    end

    return B
end

1.4 Benchmark: Run With Chunks

The above implementation is very wasteful with about ~11x allocations…

julia> @btime B_chunk(; chunk_size=125)
  56.042 s (5232139470 allocations: 87.78 GiB)
julia> @btime B_chunk(; chunk_size=248);
  56.273 s (5232139292 allocations: 89.68 GiB)

2. Run With `\cdot`

2.1 Full Run

Notably this doesn’t use @threads because \cdot is already multithreaded and the macro just adds an overhead.

import LinearAlgebra: ⋅

function B_complete(n = n, t = t)
    B = zeros(t)
    
    A = A_complete(n, t)
    for τ in 2:t
        B[τ] = A[:, τ] ⋅ v
    end

    return B
end

2.2 Benchmark: Full Run

This approach runs just as fast, while using double the allocations.

@btime B_complete();
  19.169 s (1585 allocations: 15.61 GiB)

2.3 Run With Chunks

function A_chunk(n = n; chunk_size, A_previous)
    A = zeros(n, chunk_size)

    @threads for i in 1:n 
        A[i, 1] = A_previous[i] + sin(i)
        
        for τ in 2:chunk_size
            A[i, τ] = A[i, τ - 1] + sin(i)
        end
    end 

    return A
end

function B_chunk(n = n, t = t; chunk_size)
    B = zeros(t)
    
    t_lo, t_hi = 2, chunk_size + 1
    A_previous = zeros(n)
    A = zeros(n, chunk_size)

    while t_hi < t
        A = A_chunk(; chunk_size, A_previous)
        A_previous = A[:, end]

        for τ in t_lo:t_hi
            B[τ] = A[:, τ - t_lo + 1] ⋅ v
        end

        t_lo, t_hi = t_hi + 1, t_hi + chunk_size
    end    

    A = A_chunk(; chunk_size = t - t_lo + 1, A_previous)
    for τ in t_lo:t
        B[τ] = A[:, τ - t_lo + 1] ⋅ v
    end

    return B
end

2.4 Benchmark: Run With Chunks

It’s can be the same, worse or better than the full run with \cdot, depending on chunk_size. Interestingly, this version of the function seems to strongly prefer small chunk_size = 14, possibly related to the number of threads available, 16.

julia> @btime B_chunk(chunk_size = 14);
  6.950 s (4670 allocations: 16.38 GiB)
julia> @btime B_chunk(chunk_size = 25);
  11.118 s (3262 allocations: 16.30 GiB)
julia> @btime B_chunk(; chunk_size=125)
  21.819 s (1854 allocations: 17.61 GiB)

Conclusions?

For now, I’ve refactored my code to use \cdot. But I am still rather puzzled by why the version of B_chunk() presented at 1.3 using sum() should be so slow, at least in relation to the same size B_complete(). At a human-reading level it just feels like B_chunk() performs the same steps as B_complete() except every chunk_size steps it renews the A matrix.

Initially I thought the issue was my lack of experience with multi-threading, but I also ran a @btime on a version of B_chunks without @threads, but the allocations remained the same.

mikmoore · January 28, 2025, 3:47pm

mul!(B, A', v) seems to be what you’re looking for. Or just B = A' * v if you don’t have B preallocated. You might have to @view to trim off the τ==1 case, since you seem to skip that index, but it looks like your A already has the structure such that trimming it isn’t necessary.
mul!/* are already multithreaded (because they call BLAS for some AbstractArray types like Matrix{Float64}, which you appear to use here and BLAS is multithreaded).

Array slicing allocates new arrays, which is probably hurting you here. Consider using A_previous = @view A[:, end] and B[τ] = @view(A[:, τ]) ⋅ v in situations like this.

I didn’t get deep into the rest of your code because it’s a little complex and it’s early here, but maybe start with some of these pieces.

mbauman · January 28, 2025, 3:48pm

In 1.3, you re-use the name A many times. Writing sum(A[i] for i in ...) creates a closure for the generator. And I’m quite sure that Julia isn’t able to prove that A doesn’t get renamed in the midst of computing that sum of the generator. And so it dramatically pessimizes each and every access with a box to allow it to handle a wild rename.

In short, without looking any deeper, this is almost certainly Julia#15276, one of the last remaining infamous issue numbers I have memorized. The standard workarounds are to either:

use a distinct name for each of those A = assignments
introduce a let A=A block around each @threads loop to simplify the scope that Julia has to examine.

Deceneu · February 2, 2025, 12:20am

Thank you for your suggestions!

It seems I had completely forgot how mutability works . I have redesigned my functions such that B_chunk defines a buffer A_buffer = Array{Float64, 2}(undef, n, t) and initial state A_previous = zeros(n), while A_chunk! simply updates them. With this change, along with using @view(s) where necessary, I was able to greatly reduce the number of allocations down to \propto 8\mathtt{n}\times\mathtt{chunk\_size} bytes.

Deceneu · February 2, 2025, 1:04am

Hey, thank you for your reply!

While I was able to get my function to work as intended, as I’ve detailed in my other comment to @mikmoore, I could not figure out why or how to stop sum() from acting up. As far as I can tell the issue is the mere presence of the larger loop, as even when setting the batch_size = t - 1, which should skip the while loop altogether, a lot of extra memory gets allocated.

Regarding your first suggestion, I am not sure how to proceed with that, as it seems to require meta-programming?

Regarding the second variant, I tried a number of permutations using the let A = block, none of which helped at all.

function B_chunk(n = n, t = t; chunk_size)
    B = zeros(t)
    
    t_lo, t_hi = 2, chunk_size + 1
    A_previous = zeros(n)

    while t_hi < t

        let A = A_chunk(n; chunk_size, A_previous)
            for τ in t_lo:t_hi
                B[τ] = sum(@views(A[i, τ - t_lo + 1] * v[i]) for i in 1:n)
            end
            A_previous = @view A[:, end]
        end

        t_lo, t_hi = t_hi + 1, t_hi + chunk_size
    end    

    let A = A_chunk(n; chunk_size = t - t_lo + 1, A_previous)
        for τ in t_lo:t
                B[τ] = sum(@views(A[i, τ - t_lo + 1] * v[i]) for i in 1:n)
        end
    end

    return B
end

Even with the let block and without the @threads, no significant change occurred.

Topic		Replies	Views
--track-allocation and @threads General Usage question , multithreading	10	752	August 1, 2022
Puzzled by the difference in performance when summing an array in different ways Performance array , memory-allocation	4	872	July 19, 2019
Allocations of @threads Performance memory-allocation	10	2295	August 28, 2021
Extra allocation in @thread loop in v1.2 Performance question	2	675	August 21, 2019
Allocations in @threads loop General Usage multithreading	3	948	May 20, 2017