Why is `collect()` faster than a for loop in this numerical scheme? (corrected!)

maxkapur · March 17, 2022, 2:33am

Consider a toy algorithm that finds the maximum of a vector, then removes it and alters the other entries of the vector in a way that depends on whether they are located at a lower or higher index.

^{(N.B. I posted a simpler version of this example a few days ago but made a very silly mistake in the benchmark.)}

In principle, I expect that using two for loops (one for the indices before the maximum, and one for those after) to updated the entries of the vector individually should be faster than creating a new vector from scratch (using collect and a ternary operator to check indices). However, this is not the case in the following example:

import Base.isless


# User-facing struct that holds the instance data
struct MyDataFrame
    m::Integer
    xs::Vector{<:Real}
    ys::Vector{<:Real}
    xys::Vector{<:Real}      # = xs .* ys
    omxs::Vector{<:Real}     # = 1 .- xs

    function MyDataFrame(xs::Vector{<:Real}, ys::Vector{<:Real})
        m = length(xs)

        return new(m, xs, ys, xs .* ys, 1 .- xs)
    end
end


# Struct used internally in manipulatedata()
struct DataPoint
    j::Int
    x::Real
    y::Real
    xy::Real     # = x * y
    omx::Real    # = 1 - x
end


# Overload isless() so that DataPoints sort on xy
isless(c1::DataPoint, c2::DataPoint) = isless(c1.xy, c2.xy)



function manipulatedata(mdf::MyDataFrame, coll = true::Bool)
    # Convert the user-supplied MyDataFrame into a vector of DataPoints
    datapoints = DataPoint[DataPoint(j, mdf.xs[j], mdf.ys[j], mdf.xys[j], mdf.omxs[j]) for j in 1:mdf.m]
    
    for _ in 1:mdf.m
        dp_best, idx_best = findmax(datapoints)
    
        if coll     # This way is faster
            deleteat!(datapoints, idx_best)

            datapoints[:] = collect(
                DataPoint(
                    c.j,
                    c.x,
                    c.j < dp_best.j ? c.y * dp_best.omx : c.y - dp_best.xy,
                    c.j < dp_best.j ? c.xy * dp_best.omx : c.xy - c.x * dp_best.xy,
                    c.omx
                )
                for c in datapoints
            )
        else         # This way is slower
            for i in 1:idx_best-1
                datapoints[i] =
                    DataPoint(
                        datapoints[i].j,
                        datapoints[i].x,
                        datapoints[i].y * dp_best.omx,
                        datapoints[i].xy * dp_best.omx,
                        datapoints[i].omx
                    )
            end
            for i in idx_best+1:length(datapoints)
                datapoints[i] =
                    DataPoint(
                        datapoints[i].j,
                        datapoints[i].x,
                        datapoints[i].y - dp_best.xy,
                        datapoints[i].xy - datapoints[i].x * dp_best.xy,
                        datapoints[i].omx
                    )
            end
        
            deleteat!(datapoints, idx_best)
        end
    end

    return nothing
end


function timer(m)
    rectlist = MyDataFrame(rand(m), rand(m))

    @time r1 = manipulatedata(rectlist, true)
    @time r2 = manipulatedata(rectlist, false)
    @assert r1 == r2
end

timer(2000)

Results:

1.387089 seconds (17.99 M allocations: 416.908 MiB, 6.05% gc time, 9.47% compilation time)
4.455283 seconds (49.99 M allocations: 1.667 GiB, 9.34% gc time)

Why might this be?

Note that all of the structs defined in the example above appear to be necessary to obtain this unusual performance result. If, for example, instead of taking a MyDataFrame as the argument to manipulatedata() you take the vectors x and y (and modify the line datapoints = ... accordingly), then the for loop becomes faster as expected.

Also, the for loop becomes faster if the eltype of the vector is a primitive like Float64 rather than the custom struct DataPoint (with its custom overload of isless()).

(The j field of DataPoint may appear superfluous, but it is essential in my real application because it serves as a consistent label that each data point retains even as the indices of datapoints vary.)

Henrique_Becker · March 17, 2022, 2:52am

I would suggest to:

Move the creation of datapoints to outside the function.
Break the function into two instead of using a Bool parameter and and if statement.
Use @code_* (where * may be llvm, lowered, native, or typed) to see if it compiles/lowers to something completely distinct.

Also, your loop variant allocates much more times and more memory in total, what is considerably strange to me. Do the difference go away if instead Real you use Float64 in the DataPoint fields? I would guess something related to boxing.

Henrique_Becker · March 17, 2022, 2:57am

hmmm, making the Real → Float64 replacement makes the measurements in my notebook go from:

julia> timer(2000)
  0.394244 seconds (17.84 M allocations: 410.193 MiB, 2.05% gc time)
  1.405240 seconds (49.94 M allocations: 1.665 GiB, 5.27% gc time)

to

julia> timer(2000)
  0.021393 seconds (33.05 k allocations: 77.432 MiB, 8.59% gc time)
  1.086315 seconds (55.96 M allocations: 1.756 GiB, 2.62% gc time)

There is something very strange with those allocations.

stevengj · March 17, 2022, 3:06am

Did you read the performance tips? See avoid fields with abstract type and avoid containers with abstract type parameters.

Until you fix this, everything will be so slow that there’s not much point in benchmarking different variations.

maxkapur · March 17, 2022, 3:06am

Indeed, it reduces the overall computation time, but the for loop is still slower. Weird, huh?

Re: your other suggestions, I’m not sure what you mean by moving datapoints to outside of the function. The larger package uses MyDataFrame as the main I/O interface, and there are many different functions that the user can call once she has her data in a MyDataFrame. Do you mean create a separate function that converts a MyDataFrame to a Vector{DataPoint}?

The bool switch is just for the purpose of this demonstration. In the real package I will simply delete the slow code! But you can confirm that the bool switch is not the cause of the problem by removing it and commenting/uncommenting the relevant code instead.

Henrique_Becker · March 17, 2022, 3:17am

The suggestions (1) and (2) are not an end in themselves, but just to allow for you to do step (3) and check the generated code. Sorry, I think I could have spelt that clearer.

mikkoku · March 17, 2022, 8:00am

Note that there are two abtract types used: Real and Integer. Making both of them concrete resulted in

julia> timer(2000)
  0.070830 seconds (3.59 k allocations: 76.506 MiB, 23.38% gc time)
  0.012317 seconds (2 allocations: 78.172 KiB)

GunnarFarneback · March 17, 2022, 8:48am

Some reflections:

Unreasonable (for the task at hand) amounts of allocations and for loops being considerably slower than vectorized alternatives are strong indications of type stability problems. As has been pointed out already, abstractly typed fields are a major issue in this area.
Learning to use BenchmarkTools (including the setup functionality to handle mutated input) is well invested time if you need to benchmark stuff.
The findmax/deleteat! approach doesn’t scale very well once you have fixed all other issues. It’s very application dependent how to improve it but generally you don’t want to do a full search (findmax) through your data in every iteration if you don’t absolutely have to and you don’t want to move lots of data around (deleteat!) in every iteration if you don’t absolutely have to. Sometimes the latter can be avoided by just swapping the selected element with the last element.

Henrique_Becker · March 17, 2022, 1:42pm

I also saw that in the code but, as someone with some love for algorithm complexity, it does not seem trivial to change the complexity. The code changes every element of the vector but the one to be removed at each iteration. And the removal just shifts half the array one position what is less effort than this. The search consider a field that changes differently for each element based on the other fields of the element (so I am not sure if you can easily predict ahead which are the next ones to be removed). So the code seems very strongly O(n²). Not deleting at the middle, but instead using the update loop to also do the shift, and instead deleting the last position, is a good improvement but will not change the overall complexity.

GunnarFarneback · March 17, 2022, 2:10pm

I didn’t analyze this example in detail as it wasn’t clear how close to the real problem it is, but at the very least the lower numbered ones are scaled uniformly with respect to the xy field and will keep their relative ordering. Hopefully there’s some geometric insight or similar in the original problem that allows a more efficient determination of the next point to eliminate.

maxkapur · March 18, 2022, 5:33am

Thank you everyone for all your feedback. The fact that experienced Julia users are willing to give such a detailed code review is one of my favorite things about this community.

I had read the performance tips a while ago, but at the time I was still new to structs in general, so that advice didn’t stick. I redid this example with Int and Float64 instead of abstract types and saw a huge improvement in the speed.

I have a few follow-up questions. First, does the advice to avoid abstract types in struct definitions apply to structs defined parametrically such as the following?

struct MyStruct{T<:Real}
    x::Vector{T}
    y::Vector{T}
end

In my mental model (which is often wrong), this is equivalent to defining

struct MyStruct{Float64}
    x::Vector{Float64}
    y::Vector{Float64}
end

struct MyStruct{Float32}
    x::Vector{Float32}
    y::Vector{Float32}
end

# ...

and so on for all the real numeric types. Can I expect it to achieve similar performance?

Second, regarding the use of deleteat!(), I assume that on a low-level this is equivalent to moving all the subsequent indices up by one. That is, if we call v = collect(1:5); deleteat!(v, 3), then julia will do
[1, 2, 3, 4, 5] -> [1, 2, 4, 5, undef] -> [1, 2, 4, 5]. Is that right?

If so, then deleteat!() is O(n), and as @Henrique_Becker mentioned this will not change the overall complexity of the algorithm. This doesn’t mean that swapping indices as @GunnarFarneback suggested won’t be faster, but it would require iterating through the whole array (as in the if coll ... loop above) and comparing c.js instead, so I think this is a “have to test it to know” situation.

The findmax()s are also an unavoidable bottleneck. I have another implementation of this that uses a BinaryMaxHeap from the DataStructures.jl package to cut down the time on this step, and if the keys are updated in the correct manner it is also O(n^2).

Finally, regarding BenchmarkTools, it’s a great tool. However, in this post I just wanted to create a minimal, verifiable example, and I think this is exactly what the @time macro is meant to be used for.

baggepinnen · March 18, 2022, 5:43am

Yes, the parametric definition is usually preferred since it’s more generic, and an isntance of MyStruct{Float64} is indeed equivalent to the definition where Float64 is hard coded.

The time macro is ment to time expressions, but with the limitation that it’s very unreliable and noisy for expressions that take a small amount of time to evaluate. From the documentation of @time

│ Note
│
│ For more serious benchmarking, consider the @btime macro from the BenchmarkTools.jl package which among other things evaluates the
│ function multiple times in order to reduce noise.

maxkapur · March 18, 2022, 5:58am

Thank you. Then another follow-up: What if I wrote this?

struct MyStruct{T<:Real, U<:Real}
    x::Vector{T}
    y::Vector{U}
end

By extrapolation, is this equivalent to hard-coding struct{T, U} for all pairs of T and U in the reals?

In my theoretician’s understanding, it seems like if Julia can optimize for the struct defined above, then it should be able to optimize

struct MyStruct
    x::Vector{<:Real}
    y::Vector{<:Real}
end

too by applying the same reasoning, no?

(Obviously this would only work for structs having a small number of fields, since the number of combinations of subtypes of Real grows exponentially, but this limit could be specified in the compiler. But I am not knowledgeable about compiler design, and it seems like one could reasonably argue that this is an edge case that isn’t worth considering.)

nilshg · March 18, 2022, 6:41am

This thread has been going for over a day without @DNF providing any guidance as to one should or should not be using collect!?

DNF · March 18, 2022, 6:49am

Well, sorry about that. But I thought at first this had something to do with Dataframes, and then I tend to zone out.

Maybe later.

DNF · March 18, 2022, 6:53am

No. The compiler only knows the type, not the values.

If it sees MyStruct{Float64}, it knows it will be working with vectors of Float64.

If it just sees MyStruct, it only knows it will be working with vectors of some subtype of Real. The actual values and actual element types are not known until runtime.

Henrique_Becker · March 18, 2022, 1:44pm

If you get the loop problem (probably a type-instability) solved, then an improvement that will not change the complexity, but will probably have some positive effect is to use the update loop to also shift the array and save the next element to be deleted. This way you only visit each element one time, instead of the current average two visits. If the vector do not fit in cache, this may be faster, if it fits and was using vectorized assignment, then it may not yield a good improvement.

Henrique_Becker · March 18, 2022, 2:50pm

I should have, but I could not contain myself. Can you try this code in your machine? Note I increased the size by 10, so the difference is less affected by noise.

import Base.isless
import Random


# User-facing struct that holds the instance data
struct MyDataFrame
    m::Int
    xs::Vector{Float64}
    ys::Vector{Float64}
    xys::Vector{Float64}      # = xs .* ys
    omxs::Vector{Float64}     # = 1 .- xs

    function MyDataFrame(xs::Vector{Float64}, ys::Vector{Float64})
        m = length(xs)

        return new(m, xs, ys, xs .* ys, 1 .- xs)
    end
end


# Struct used internally in manipulatedata()
struct DataPoint
    j::Int
    x::Float64
    y::Float64
    xy::Float64     # = x * y
    omx::Float64    # = 1 - x
end


# Overload isless() so that DataPoints sort on xy
isless(c1::DataPoint, c2::DataPoint) = isless(c1.xy, c2.xy)

function manipulatedata1(mdf::MyDataFrame)
    # Convert the user-supplied MyDataFrame into a vector of DataPoints
    datapoints = DataPoint[DataPoint(j, mdf.xs[j], mdf.ys[j], mdf.xys[j], mdf.omxs[j]) for j in 1:mdf.m]

    dp_best :: DataPoint, idx_best :: Int = findmax(datapoints)
    for _ in 1:mdf.m
        dp_best, idx_best = findmax(datapoints)
        #tmp = findmax(datapoints)
        #dp_best = first(tmp) :: DataPoint
        #idx_best = last(tmp) :: Int
        deleteat!(datapoints, idx_best)

        datapoints[:] = collect(
            DataPoint(
                c.j,
                c.x,
                c.j < dp_best.j ? c.y * dp_best.omx : c.y - dp_best.xy,
                c.j < dp_best.j ? c.xy * dp_best.omx : c.xy - c.x * dp_best.xy,
                c.omx
            )
            for c in datapoints
        )
    end

    return nothing
end

function manipulatedata2(mdf::MyDataFrame)
    # Convert the user-supplied MyDataFrame into a vector of DataPoints
    datapoints = DataPoint[DataPoint(j, mdf.xs[j], mdf.ys[j], mdf.xys[j], mdf.omxs[j]) for j in 1:mdf.m]

    dp_best :: DataPoint, idx_best :: Int = findmax(datapoints)
    for _ in 1:mdf.m
        datapoints[1] =
            DataPoint(
                datapoints[1].j,
                datapoints[1].x,
                datapoints[1].y * dp_best.omx,
                datapoints[1].xy * dp_best.omx,
                datapoints[1].omx
            )
        next_dp_best :: DataPoint = datapoints[1]
        next_idx_best :: Int = 1
        for i in 2:(idx_best-1)
            datapoints[i] =
                DataPoint(
                    datapoints[i].j,
                    datapoints[i].x,
                    datapoints[i].y * dp_best.omx,
                    datapoints[i].xy * dp_best.omx,
                    datapoints[i].omx
                )
            if isless(next_dp_best, datapoints[i])
                next_dp_best = datapoints[i]
                next_idx_best = i
            end
        end
        for i in idx_best+1:length(datapoints)
            datapoints[i-1] =
                DataPoint(
                    datapoints[i].j,
                    datapoints[i].x,
                    datapoints[i].y - dp_best.xy,
                    datapoints[i].xy - datapoints[i].x * dp_best.xy,
                    datapoints[i].omx
                )
            if isless(next_dp_best, datapoints[i - 1])
                next_dp_best = datapoints[i - 1]
                next_idx_best = i - 1
            end
        end

        dp_best = next_dp_best
        idx_best = next_idx_best
        pop!(datapoints)
    end

    return nothing
end

function timer(m)
    Random.seed!(0)
    rectlist = MyDataFrame(rand(m), rand(m))

    @time r1 = manipulatedata1(rectlist)
    @time r2 = manipulatedata2(rectlist)
    @assert r1 == r2
end

timer(20000)

rdeits · March 18, 2022, 3:22pm

Yup.

maxkapur:

In my theoretician’s understanding, it seems like if Julia can optimize for the struct defined above, then it should be able to optimize
struct MyStruct
    x::Vector{<:Real}
    y::Vector{<:Real}
end
too by applying the same reasoning, no?

Nope. The thing to think about is: if you have a particular struct, what can you tell me about it from just its type?

Using the first version, where we have:

struct MyStruct{T<:Real, U<:Real}
    x::Vector{T}
    y::Vector{U}
end

and we create a particular instance for some T and U, say T = Int, U = Float64, then the type of our struct is MyStruct{Int, Float64}. Given just that type information, we can ask:

What is the type of mystruct.x[1] ? It’s Int
What is the type of mystruct.y[1] ? It’s Float64

and so on for any operations you might do on mystruct.

Now let’s talk about the other definition,

struct MyStruct
    x::Vector{<:Real}
    y::Vector{<:Real}
end

and let’s again say we create a particular instance named mystruct. The entire type of that struct is just MyStruct, regardless of what you’ve put in x and y in that particular case. That means that, just from the type MyStruct, we can no longer answer questions like "What is the type of mystruct.x[1] because you could have lots of different MyStructs all with the same type but different types of values stored in x and y.

maxkapur · March 18, 2022, 11:43pm

@Henrique_Becker What a striking difference, especially in the memory allocations! I see what you have done in manipulatedata2().

  8.453391 seconds (59.59 k allocations: 7.453 GiB, 14.48% gc time)
  1.011364 seconds (2 allocations: 781.297 KiB)

This is a really slick optimization that would never have occurred to me. Soft question, but did you ever explicitly learn about these kinds of “tricks” for integrating a findmax with a for loop, or is this just an instinct you have acquired through experience?

@rdeits @DNF Thank you for your explanations. This makes sense.

Topic		Replies	Views
Why is this numerical scheme so much faster when using `collect()` than a for loop? Performance	4	542	March 17, 2022
Julia-ism for two-dimensional map? New to Julia	22	3180	May 1, 2018
Understanding benchmark results General Usage	12	2051	January 12, 2017
Julia style? Arrays/matrix vs structs General Usage	19	3973	December 18, 2019
Significant decrease in performance after seemingly irrelevant changes Performance	16	1241	June 29, 2019

Why is `collect()` faster than a for loop in this numerical scheme? (corrected!)

Related topics