Can I decrease allocations here?

I’m trying to further optimize performance-critical function that locates values on a grid. Here I have tried using that the values will be sorted. And the bulk of the code is fast and has only 3 allocations, but then the final statement hits me with 2131 allocations :exploding_head: Could you help me minimize them?

function locate_vector(
    values::AbstractVector{<:Real},
    gridpoints;
    locate_below::Bool=false,
    locate_above::Bool=false,
)
    function incrementalsearch(low, value, gridpoints)
        while value >= gridpoints[low + 1]
            low += 1
        end

        return low
    end

    indices = ones(Integer, size(values))
    weights = Vector{Tuple}(undef, size(values))

    low = 1
    high = length(gridpoints) - 1

    for ix in eachindex(values)
        if values[ix] <= gridpoints[1]
            # indices[ix] = low = 1  # No need as indices and low are initialized to 1

            if !locate_below
                weights[ix] = (1.0, 0.0)
            end
        elseif values[ix] >= gridpoints[end]
            indices[ix:end] .= length(gridpoints) - 1

            if !locate_above
                weights[ix:end] .= repeat((0.0, 1.0), length(weights[ix:end]))
            end

            break
        elseif low == high
            indices[ix] = high
        else
            indices[ix] = low = incrementalsearch(low, values[ix], gridpoints)
        end
    end

    no_weights = filter(ix -> !isassigned(weights, ix), 1:length(weights))
    weights[no_weights] .= map(  # Crazy number of allocations
        ix -> (
            gridpoints[indices[ix] .+ 1] .- values[ix],
            values[ix] .- gridpoints[indices[ix]]
        ) ./ (
            gridpoints[indices[ix] .+ 1]
            .- gridpoints[indices[ix]]
        ),
        no_weights,
    )

    return indices, weights
end

using BenchmarkTools
gridpoints = collect(range(0.1, 0.9, 100));
values = sort(rand(101));
@btime locate_vector(values, gridpoints; locate_above=true);
# 256.400 μs (2134 allocations: 63.00 KiB)

You could use a comprehension (lazy) instead of map (eager constructing a vector) like so

    weights[no_weights] .= ((
            gridpoints[indices[ix] .+ 1] .- values[ix],
            values[ix] .- gridpoints[indices[ix]]
        ) ./ (
            gridpoints[indices[ix] .+ 1]
            .- gridpoints[indices[ix]]
        ) for ix in no_weights)

That helped, but only by very, very little—allocations dropped by 6 :thinking:

Confirmed (checked that with profiler). Go back to ‘map’ and try

    weights = Vector{Tuple{Float64, Float64}}(undef, size(values))

This looks slightly better here. (Vector{Tuple} means implicitly Vector{Tuple{Any...}} and this is no good).

2 Likes

Afterwards this

    for ix in no_weights
        weights[ix] =
            (
                gridpoints[indices[ix] .+ 1] .- values[ix],
                values[ix] .- gridpoints[indices[ix]]
            ) ./ (
                gridpoints[indices[ix] .+ 1]
                .- gridpoints[indices[ix]]
            )
    end

eliminates the last dynamic dispatch yielding

  770.635 ns (5 allocations: 3.61 KiB)
2 Likes

Great! Thank you!

Noob question but how did you check that with the profiler?

1 Like

I’m still using Atom/Juno for this with a script like

using Profile
Profile.clear()
gridpoints = collect(range(0.1, 0.9, 100));
values = sort(rand(101));
locate_vector(values, gridpoints; locate_above=true);
@profile for i in 1:1000; locate_vector(values, gridpoints; locate_above=true); end
# Profile.print()
Juno.profiler() 

Here is before

and here is after

Color coding: yellow is dynamic dispatch, red is allocation…

2 Likes

I understand that you fixed Vector{Tuple} to get a concrete element type, right? But did you also fix indices? Integer is an abstract type, you should use

indices = ones(Int, size(values))

instead.

2 Likes

Wait a minute! Did you check the correctness of the results?

Changing from this:

weights = Vector{Tuple}(undef, size(values))

to

weights = Vector{Tuple{Float64, Float64}}(undef, size(values))

may have some unintended consequences. Later in the code @fredrikpaues uses !isassigned(weights, ix), but that makes no sense for the latter, because of this:

1.7.2> v1 = Vector{Tuple}(undef, 3)
3-element Vector{Tuple}:
 #undef
 #undef
 #undef

1.7.2> isassigned(v1, 1)
false

1.7.2> v2 = Vector{NTuple{2,Float64}}(undef, 3)
3-element Vector{Tuple{Float64, Float64}}:
 (2.5e-323, 3.5e-323)
 (4.0e-323, 4.4e-323)
 (5.0e-323, 5.4e-323)

1.7.2> isassigned(v2, 1)
true

There is no #undef value for bitsvalues like these, and isassigned is always true.

I’m not 100% sure if the last code version still has isassigned because the image was very small and blurry, but if so, the algorithm should be redesigned, without relying on isassigned, which is not a great idea.

One way is to initialize like this:

weights = fill((NaN, NaN), size(values))

and check for isnan instead of isassigned. Also, I wouldn’t create no_weights first, and then iterate over them. Just loop over weights and check for NaN in each iteration.

After doing this, there are only 2 allocations, one for indices and one for weights, as expected.

(Also, what’s up with all the broadcasting dots in that loop, are they needed?)

2 Likes

Oh, and this one:

This needlessly creates, not just one, but two arrays on the right side. Firstly, length(weights[ix:end]) allocates a whole array, just to measure its length. And then repeat creates another one.

But there’s no reason to do this. Just assign the same value to all elements on the left. You’ll have to wrap the right hand side in a Ref though, to avoid broadcasting:

weights[ix:end] .= Ref((0.0, 1.0))
4 Likes

@DNF: thanks for the code review. You are right on all counts, especially the unintended change in results! @tbeason : thanks!

For future reference and @fredrikpaues : here are the missing pieces

function locate_vector1(
    values::AbstractVector{<:Real},
    gridpoints;
    locate_below::Bool=false,
    locate_above::Bool=false,
)
    function incrementalsearch(low, value, gridpoints)
        while value >= gridpoints[low + 1]
            low += 1
        end

        return low
    end

    indices = ones(Integer, size(values))
    weights = Vector{Tuple}(undef, size(values))

    low = 1
    high = length(gridpoints) - 1

    for ix in eachindex(values)
        if values[ix] <= gridpoints[1]
            # indices[ix] = low = 1  # No need as indices and low are initialized to 1

            if !locate_below
                weights[ix] = (1.0, 0.0)
            end
        elseif values[ix] >= gridpoints[end]
            indices[ix:end] .= length(gridpoints) - 1

            if !locate_above
                weights[ix:end] .= Ref((0.0, 1.0))
            end

            break
        elseif low == high
            indices[ix] = high
        else
            indices[ix] = low = incrementalsearch(low, values[ix], gridpoints)
        end
    end

    no_weights = filter(ix -> !isassigned(weights, ix), 1:length(weights))
    weights[no_weights] .= map(  # Crazy number of allocations
        ix -> (
            gridpoints[indices[ix] .+ 1] .- values[ix],
            values[ix] .- gridpoints[indices[ix]]
        ) ./ (
            gridpoints[indices[ix] .+ 1]
            .- gridpoints[indices[ix]]
        ),
        no_weights,
    )

    return indices, weights
end

function locate_vector2(
    values::AbstractVector{<:Real},
    gridpoints;
    locate_below::Bool=false,
    locate_above::Bool=false,
)
    function incrementalsearch(low, value, gridpoints)
        while value >= gridpoints[low + 1]
            low += 1
        end

        return low
    end

    indices = ones(Int, size(values))
    weights = Vector{Union{Nothing, Tuple{Float64, Float64}}}(nothing, size(values))

    low = 1
    high = length(gridpoints) - 1

    for ix in eachindex(values)
        if values[ix] <= gridpoints[1]
            # indices[ix] = low = 1  # No need as indices and low are initialized to 1

            if !locate_below
                weights[ix] = (1.0, 0.0)
            end
        elseif values[ix] >= gridpoints[end]
            indices[ix:end] .= length(gridpoints) - 1

            if !locate_above
                weights[ix:end] .= Ref((0.0, 1.0))
            end

            break
        elseif low == high
            indices[ix] = high
        else
            indices[ix] = low = incrementalsearch(low, values[ix], gridpoints)
        end
    end

    no_weights = filter(ix -> weights[ix] === nothing, 1:length(weights))
    for ix in no_weights
        weights[ix] =
            (
                gridpoints[indices[ix] .+ 1] .- values[ix],
                values[ix] .- gridpoints[indices[ix]]
            ) ./ (
                gridpoints[indices[ix] .+ 1]
                .- gridpoints[indices[ix]]
            )
    end

    return indices, weights
end

gridpoints = collect(range(0.1, 0.9, 100));
values = sort(rand(101));
result1 = locate_vector1(values, gridpoints; locate_above=false);
result2 = locate_vector2(values, gridpoints; locate_above=false);
@assert(result1 == result2)
result1 = locate_vector1(values, gridpoints; locate_above=true);
result2 = locate_vector2(values, gridpoints; locate_above=true);
@assert(result1 == result2)

using BenchmarkTools
gridpoints = collect(range(0.1, 0.9, 100));
values = sort(rand(101));
@btime locate_vector1($values, $gridpoints; locate_above=false); 
@btime locate_vector1($values, $gridpoints; locate_above=true); 
@btime locate_vector2($values, $gridpoints; locate_above=false); 
@btime locate_vector2($values, $gridpoints; locate_above=true); 

yielding

  110.000 μs (1429 allocations: 44.39 KiB)
  144.400 μs (1627 allocations: 50.19 KiB)
  873.077 ns (4 allocations: 4.17 KiB)
  892.000 ns (4 allocations: 4.23 KiB)
1 Like

Don’t forget to interpolate values in @btime

julia> @btime locate_vector2(values, gridpoints; locate_above=true);
  677.914 ns (5 allocations: 4.34 KiB)

julia> @btime locate_vector2($values, $gridpoints; locate_above=$true);
  633.136 ns (4 allocations: 4.31 KiB)
2 Likes

Actually, here’s a version with just 2 allocations:

function locate_vector3(
    values::AbstractVector{<:Real},
    gridpoints;
    locate_below::Bool=false,
    locate_above::Bool=false,
)
    function incrementalsearch(low, value, gridpoints)
        while value >= gridpoints[low + 1]
            low += 1
        end
        return low
    end

    indices = fill(1, size(values))
    weights = fill((NaN, NaN), size(values))

    low = 1
    high = length(gridpoints) - 1

    for ix in eachindex(values)
        if values[ix] <= gridpoints[1]
            if !locate_below
                weights[ix] = (1.0, 0.0)
            end
        elseif values[ix] >= gridpoints[end]
            indices[ix:end] .= length(gridpoints) - 1

            if !locate_above
                weights[ix:end] .= Ref((0.0, 1.0))
            end

            break
        elseif low == high
            indices[ix] = high
        else
            indices[ix] = low = incrementalsearch(low, values[ix], gridpoints)
        end
    end

    for i in eachindex(weights)
        w = weights[i]
        if isnan(w[1])
            grid1 = gridpoints[indices[i]]
            grid2 = gridpoints[indices[i]+1]
            val = values[i]
            weights[i] = ((grid2 - val) / (grid2 - grid1),
                          (val - grid1) / (grid2 - grid1))
        end
    end
    return indices, weights
end
1.7.2> @btime locate_vector3($values, $gridpoints; locate_above=true);
  575.275 ns (2 allocations: 2.64 KiB)
2 Likes

By the way:

Are both gridpoints and values expected to be sorted, and even possibly regularly spaced? You don’t seem to exploit that. Your search function is woefully suboptimal.

1 Like

Thank you! Both are sorted. None regularly spaced. How is the search function suboptimal? (I have another search function which uses binary search, but I just included one of them here.)

I would have thought that binary search was better, but looking closer, it seems the current version might not be too bad.

1 Like

I imagine some kind of merge variation could help?

Along the lines of

    iv = 1
    ig = 1
    while iv <= length(values) && ig <= length(gridpoints) 
        if values[iv] < gridpoints[ig]
            # do something
            iv += 1
        else if values[iv] == gridpoints[ig]
            # do something
            iv += 1
            ig += 1
        else
            # do something
            ig += 1
        end
    end
    while iv <= length(values)
        # do something
        iv += 1
    end

By the way, I just learned that eachindex can take multiple inputs. Is there a good reason (not) to write èachindex(values, indices, weights) here? I’m thinking that it might be a good idea, just in case values isn’t 1-based.