Performance hit of allocation regardless of size

gdalle · September 27, 2024, 5:19am

Hi there performance fanatics!
@amontoison and I are working on our matrix coloring code, and in one of our algorithms we need to split a vector into groups, one for each color. Semantically, this is what we want to achieve

function split_into_color_groups(colors::Vector{Int})
    cmax = maximum(colors)
    groups = [Int[] for c in 1:cmax]
    for (i, c) in enumerate(colors)
        push!(groups[c], i)
    end
    return groups
end

And here’s an example:

julia> colors = rand(1:3, 10):

julia> colors'
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  3  3  1  2  2  2  2  2  1

julia> split_into_color_groups(colors)
3-element Vector{Vector{Int64}}:
 [1, 4, 10]
 [5, 6, 7, 8, 9]
 [2, 3]

Our goal is to save on the costs of memory allocations, both in terms of storage space and execution time. Alexis already made a first improvement which precomputes the correct size for each group to avoid push!, and now we’re left wondering whether to use

Option A: a vector of vectors, as above
Option B: a vector of views into the same underlying flat vector

Do you have any insights? Specifically, even though both options allocate the same number of bytes, is there a point in trying to reduce the number of allocations itself?

If you want to play around with benchmarks, here are the two optimized implementations.

function compute_group_sizes(colors::Vector{Int})
    cmax = maximum(colors)
    group_sizes = zeros(Int, cmax)
    for c in colors
        group_sizes[c] += 1
    end
    return group_sizes
end

function split_vecvec(colors::Vector{Int})
    group_sizes = compute_group_sizes(colors)
    groups = [Vector{Int}(undef, group_sizes[c]) for c in eachindex(group_sizes)]
    fill!(group_sizes, 0)
    for (k, c) in enumerate(colors)
        group_sizes[c] += 1
        pos = group_sizes[c]
        groups[c][pos] = k
    end
    return groups
end

function split_vecview(colors::Vector{Int})
    group_sizes = compute_group_sizes(colors)
    group_offsets = cumsum(group_sizes)
    groups_flat = similar(colors)
    for (k, c) in enumerate(colors)
        i = group_offsets[c] - group_sizes[c] + 1
        groups_flat[i] = k
        group_sizes[c] -= 1
    end
    TV = typeof(view(groups_flat, 1:1))
    groups = Vector{TV}(undef, length(group_sizes))  # allocation 4, size cmax
    for c in eachindex(group_sizes)
        i = 1 + (c == 1 ? 0 : group_offsets[c - 1])
        j = group_offsets[c]
        groups[c] = view(groups_flat, i:j)
    end
    return groups
end

And here is a benchmark loop:

using BenchmarkTools

for n in 10 .^ (2, 3, 4, 5), cmax in (3, 10, 30, 100)
    yield()
    bench_vecvec = @benchmark split_vecvec(_colors) setup = (_colors = rand(1:($cmax), $n))
    bench_vecview = @benchmark split_vecview(_colors) setup = (
        _colors = rand(1:($cmax), $n)
    )
    ratios = (
        time=minimum(bench_vecvec).time / minimum(bench_vecview).time,
        memory=minimum(bench_vecvec).memory / minimum(bench_vecview).memory,
        allocs=minimum(bench_vecvec).allocs / minimum(bench_vecview).allocs,
    )
    @info "Vecvec / vecview ratios - n=$n, cmax=$cmax" ratios.time ratios.memory ratios.allocs
end

Apparently, with a very low number of colors, option A is better, but then option B quickly takes over.

Tarny_GG_Channie · September 27, 2024, 6:55am

How many colors do you need? If there are only a few colors, maybe options like bool set or bitset would be faster? I need to ask what you’re using this for. This would answer questions like whether or not it is acceptable to get a bool set.

Like… you could actually compute a single, contiguous vector like this…

[1 3 3 1 2 2 2 2 2 1]
into

begin: [1, 5, 2]
next: [4,3,11,10,6,7,8,9,11,11]
where 11 means "end".

Of course, this would create a data dependency when iterating, but if you’re okay with that, this could be an option.

gdalle · September 27, 2024, 7:01am

I can’t predict the number of colors ahead of time, it can be anything from 3 to a constant fraction (usually small, like 1/100) of the vector length

Tarny_GG_Channie · September 27, 2024, 7:07am

So, you mean if the vector length is a billion, there could be over a million colors right? I know it can be annoying sometimes when people ask all sorts of questions about the nature of input data, but such is the nature of hardcore performance optimization. Like… for example, if you know that colors tend to repeat, maybe you could optimize that by some insane SIMD tricks.

gdalle · September 27, 2024, 8:35am

Okay maybe a constant fraction is not a good model. This is for sparse autodiff so the number of colors is most likely constant, say between 2 and 1000, and the number of elements can go to a million or more.
In a way option B looks like a bucket sort, but there’s no Base Julia implementation of that.

Benny · September 27, 2024, 8:54am

My admittedly limited understanding is that yes, a larger number of allocations can matter: 1) given everything else is somehow equal (it’s not, algorithm and ratio.memory differ), heap allocation takes longer, more with severe memory fragmentation, 2) you can free some of the memory you don’t need at the cost of fragmentation (at the other extreme, 1 big allocation is not freed until all references/views disappear). Other factors can easily outweigh these; for large n (10^4,10^5), the runtime is actually on par for me across the cmax given. I imagine something else is scaling differently, but I don’t know what from just looking.

Tetrakai · October 20, 2024, 3:06pm

Is minimum time elapsed the right benchmark here? Thats a subset biased toward the fewest garbage collections.

Using the minimum makes sense in a zero allocation setting because we have no control over that “noise”. We can add/remove allocations and thus affect gc time though.

bertschi · October 21, 2024, 5:05pm

This is rather short and competitive on my machine (especially if n is large and c_{max} rather small):

function splitit(v)
    ord = sortperm(v)
    tmp = view(v, ord)
    [view(ord, searchsorted(tmp, c))
     for c in 1:tmp[end]]
end

Topic		Replies	Views
Problem with creating view of vectors in vector of vectors (it keeps allocating) Performance memory-allocation , vector	2	129	July 9, 2024
Is it possible to cut a vector into 2 pieces without allocations? Performance vector , views	11	833	November 27, 2023
Avoid allocations in retrieving array parts Performance question , array , views	2	52	October 29, 2024
Why does a vector with 10 times more elements takes 2x-5x less time to pre-allocate? Performance question	14	371	October 30, 2024
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	418	November 7, 2022

Performance hit of allocation regardless of size

Related topics