Removing undesirable allocations in some functions

Greetings everyone,
Here I would like to use a specific example from a code of mine, to help me (and hopefully others) better understand how allocations work.

I’ve got a code which imports data (xyz coordinates of hydrogen particles) from a very large file, calculates the relative positions of each possible pair of particles, and then does some calculations.

After reading the basic performance tips a few times, I gather that to make code efficient, I should preallocate everything, avoid global variables, use functions, and then life should be good.
Thus, my idea is, if I preallocate some big arrays, and my functions just replace their values at each step, new allocations wouldn’t be created since new values just replace old ones.
Turns out some of my functions are actually doing that, while the others are allocating extra memory, and I can’t understand why.

Below is an MRE which captures the main points I want to make.

using LinearAlgebra
using CSV
using DataFrames

data = CSV.write("data.csv",
    DataFrame([collect(1:1000) ones(1000, 1) rand(1000, 3)], :auto))

# Preallocate everyhting 
# (shouldn't be worth paying attention to the details here)

positions::Matrix{Float64} = Matrix{Float64}(undef, 666, 3)
box::Matrix{Float64} = [-1/2 1/2; -1/2 1/2; -1/2 1/2]
Hpairs::Vector{Vector{Float64}} = [ones(Float64, 3) for _ in 1:floor(Int, 666 * (666 - 1) / 2)]
rtp = copy(Hpairs)
F::Vector{Vector{ComplexF64}} = [complex(ones(Float64, 3)) for _ in 1:floor(Int, 666 * (666 - 1) / 2)]


# Define functions

function read_H_positions!(positions::Matrix{Float64})

    # normally I'd only read a part of the file using CSV.read(....,skipto=_,limit=_)
    positions[:, :] = Matrix(CSV.read("data.csv", DataFrame)[Not(1:3:end), 3:end])

end

function getpairs!(Hpairs::Vector{Vector{Float64}}, positions::Matrix{Float64})
    x = 1
    for k in axes(positions, 1)
        for j in axes(positions, 1)
            j > k || continue
            Hpairs[x][:] = @views (positions[k, :] .- positions[j, :])
            x += 1
        end
    end
end

function periodicboundary!(Hpairs::Vector{Vector{Float64}}, box::Matrix{Float64})
    for i in 1:length(Hpairs)
        for l in 1:3
            if Hpairs[i][l] > box[l, 2]
                Hpairs[i][l] -= 2 .* box[l, 2]
            elseif Hpairs[i][l] < box[l, 1]
                Hpairs[i][l] -= 2 .* box[l, 1]
            end
        end
    end
end

function cart2sph!(rtp::Vector{Vector{Float64}}, xyz::Vector{Vector{Float64}})
    for i in eachindex(rtp)
        rtp[i][1] = norm(xyz[i], 2)
        rtp[i][2] = acos.(xyz[i][3] ./ norm(xyz[i], 2))
        rtp[i][3] = atan.(xyz[i][2], xyz[i][1])
    end
end

function F012!(F::Vector{Vector{ComplexF64}}, rtp::Vector{Vector{Float64}})
    for i in 1:length(rtp)
        F[i][:] = [(1 - 3 .* (rtp[i][2]) .^ 2) ./ rtp[i][1] .^ 3,
            (sin.(rtp[i][2]) .* cos(rtp[i][2]) .* exp(-1im .* rtp[i][3])) ./ rtp[i][1] .^ 3,
            (sin.(rtp[i][2]) .^ 2 .* exp.(-2im .* rtp[i][3])) ./ rtp[i][1] .^ 3]
    end
end

#Run functions

@time read_H_positions!(positions); 
@time getpairs!(Hpairs, positions);
@time periodicboundary!(Hpairs, box);
@time cart2sph!(rtp, Hpairs);
@time F012!(F, rtp);

It seems that periodicboundary!() and cart2sph!() functions work as intended, and they don’t generate unwanted allocations, whereas the other functions are a bit naughty. Why is that happening, and how would you deal with this?

The “Run functions” section woulld normally be in a big loop, so after a few repetitions allocations become too much, and the program pretty much never finishes.

I’m sure there are many other mistakes and pitfalls in this code, and I’d be very grateful if you could point them out, but the essence is, why the exra allocations??

Do these allocations also show up if you use @btime from BenchmarkTools.jl instead of @time?

I.e. do:

using BenchmarkTools

@btime read_H_positions!($positions); 
@btime getpairs!($Hpairs, $positions);
@btime periodicboundary!($Hpairs, $box);
@btime cart2sph!($rtp, $Hpairs);
@btime F012!($F, $rtp);

Also I notice a lot of broadcasting operations (with the dot .) but all your elements seem to be scalar. So maybe try and remove the unnecessary dots :slight_smile:

Hello @abraemer,

Yes, @btime shows the same extra allocations.

As for the dots, I read somewhere that they sometimes help with peformance, so I added them, but didn’t see any difference.

These dots are broadcasting operations across arrays, i.e. performing some action on every element. I’d recommend you to read the manual page

As for more performance there is this blog post going into more detail about the performance gain broadcasting can offer:

As for the allocations, I’m on mobile right now and can’t check myself but I’m sure someone else will comment on that soon :slight_smile:

2 Likes

I’m generally aware of the basics of broadcasting, but thank you for the reading material!

A Matrix is created here.

This array has to be allocated, of course.

Instead of doing this:

Rather try something like Hpairs[x][:] .= positions[k, :] .- positions[j, :] or @. Hpairs[x][:] = positions[k, :] - positions[j, :]. See the relevant section in the performance tips: Performance Tips · The Julia Language

Doesn’t allocate, for me at least.

Same.

An array is constructed here, unnecessarily. There’s also lots of code duplication.

1 Like

Thank you very much for the answer @nsajko.
Yes, periodicboundary and cart2sph work fine, the main point of the post was to find out why these behave nicely while the others do not.

For the getpairs!() function, adding the dot before the equal sign does the job indeed, thanks!

As for the read_H_positions!(), using the DataFrame directly without conversion still uses allocations, would it be possible to remove these by any chance? Also, using a dataframe for the “positions” gives me trouble in the getpairs!() function.

Finally, for F012()!, removing the array works as well!

function F012!(F::Vector{Vector{ComplexF64}}, rtp::Vector{Vector{Float64}})
    for i in 1:length(rtp)
        F[i][1] = complex(1 - 3 .* (rtp[i][2]) .^ 2 ./ rtp[i][1] .^ 3)
        F[i][2] = (sin.(rtp[i][2]) .* cos(rtp[i][2]) .* exp(-1im .* rtp[i][3])) ./ rtp[i][1] .^ 3
        F[i][3] = (sin.(rtp[i][2]) .^ 2 .* exp.(-2im .* rtp[i][3])) ./ rtp[i][1] .^ 3
    end
end

Another thing that baffles me is, why does the “=” have to be broadcasted in the getpairs!() function, but not in periodicboundary!(), cart2sph!() and F012!()??

1 Like

Just a note that putting a lot of data into the global namespace like this is not recommended practice. My recommendation would be to put these into a driver function.

Let’s study this function. The positions matrix requires 16 KiB. read_H_positions! allocates 300 KiB.

julia> @allocated positions::Matrix{Float64} = Matrix{Float64}(undef, 666, 3)
16128

julia> @allocated read_H_positions!(positions)
300064

We can examine how your code is lowered into simpler calls using the macro @code_lowered.

julia> @code_lowered read_H_positions!(positions)
CodeInfo(
1 ─ %1 = CSV.read
│   %2 = (%1)("data.csv", Main.DataFrame)
│   %3 = Base.lastindex(%2, 1)
│   %4 = 1:3:%3
│   %5 = Main.Not(%4)
│   %6 = Base.lastindex(%2, 2)
│   %7 = 3:%6
│   %8 = Base.getindex(%2, %5, %7)
│   %9 = Main.Matrix(%8)
│        Base.setindex!(positions, %9, Main.:(:), Main.:(:))
└──      return %9
)

The initial allocation occurs when creating %2

│   %2 = (%1)("data.csv", Main.DataFrame)

You then allocate a temporary DataFrame to create %8 when indexing.

│   %8 = Base.getindex(%2, %5, %7)

You then reallocate the data in %9

│   %9 = Main.Matrix(%8)

We can eliminate these intermediates to reduce the number of allocations. Before proceeding let’s check how many allocations does CSV.read when it does not store anything.

julia> donothing(x) = nothing
donothing (generic function with 1 method)

julia> @time CSV.read("data.csv", donothing)
  0.002728 seconds (1.10 k allocations: 76.531 KiB)

Doing nothing, CSV.read itself allocates 77 Kib. That’s our floor while still using CSV.read.

We can reduce allocations slightly here by using Tables.matrix as a sink and then using a view.

julia> function read_H_positions_less_allocations!(positions::Matrix{Float64})
           raw_data = CSV.read("data.csv", Tables.matrix; select=3:5)
           positions[1:2:end,:] .= @view raw_data[2:3:end,:]
           positions[2:2:end,:] .= @view raw_data[3:3:end,:]
       end
read_H_positions_less_allocations! (generic function with 1 method)

julia> @time read_H_positions_less_allocations!(positions);
  0.001099 seconds (1.10 k allocations: 82.000 KiB)

Note the use of the .= operator to do in-place assignment.

We can do slightly better here by writing a custom sink for CSV.read.

julia> using Tables

julia> function custom_positions_sink!(table)
           for (i,col) in enumerate(Tables.Columns(table))
               if i > 2
                   positions[1:2:end,i-2] .= @view col[2:3:end]
                   positions[2:2:end,i-2] .= @view col[3:3:end]
               end
           end
       end
custom_positions_sink! (generic function with 1 method)

julia> @time CSV.read("data.csv", custom_positions_sink!);
  0.001095 seconds (1.14 k allocations: 78.312 KiB)
4 Likes

Thanks a lot for the detailed comments, @mkitti.

The preallocated arrays would be in the “loop function” in the full code indeed, I just wanted to make it simple here.

I was desperately looking for alternative “sinks” to use in the CSV.read function, but I couldn’t find anything!
Would there be any place to read about how these sinks work?
Everywhere I look, DataFrame seems to be the only sink option!
Would it be possible for example to read the data directly in a vector of vectors??

1 Like

From Home · CSV.jl

  • CSV.read: a convenience function identical to CSV.File, but used when a CSV.File will be passed directly to a sink function, like a DataFrame. In some cases, sinks may make copies of incoming data for their own safety; by calling CSV.read(file, DataFrame), no copies of the parsed CSV.File will be made, and the DataFrame will take direct ownership of the CSV.File’s columns, which is more efficient than doing CSV.File(file) |> DataFrame which will result in an extra copy of each column being made. Keyword arguments are identical to CSV.File. Any valid Tables.jl sink function/table type can be passed as the 2nd argument. Like CSV.File, a vector of data inputs can be passed as the 1st argument, which will result in a single “long” table of all the inputs vertically concatenated. Each input must have identical schemas (column names and types).

Essentially what that means if that you need to understand the Tables.jl interface.

https://tables.juliadata.org/stable/

Essentially all that is happening is Tables.CopiedColumns(CSV.File(source; kwargs...)) |> sink. I found that by doing @less CSV.read("data.csv", x->nothing).

2 Likes

Brilliant, I’ll have a closer look at Tables, thanks again!

This still allocates a vector for the right hand side, before assigning it to the left. You could remove the allocation by using .=, which fuses the loop on the right-hand side with the assignment loop on the left. (Of course, your positions[j, :] access is in the wrong order for locality too.)

What are the dots for here? Dots do nothing if everything is a scalar. (They aren’t a magic “make it faster” annotation.) Read the more dots blog post for a deeper dive into dots in Julia.

Note also that these argument-type declarations do nothing for performance. They just limit your functions applicability (e.g. you can’t run your code in single precision). See the manual on argument-type declarations. If you wanted declarations for documentation purposes, a more general type to accept would be AbstractVector{<:AbstractVector{<:Real}}.

You are also missing an important performance tip here: consider StaticArrays for small fixed-sized vectors. 3-component position vectors are the ideal case for StaticArrays.jl. You know in advance that all your vectors have three components, and it is typically both faster and more convenient to work with a Vector{SVector{3,Float64}} than a Vector{Vector{Float64}}. Faster, because (e.g.) the former are stored contiguously in memory as 3 numbers after another, whereas the latter are an array of pointer to Vector{Float64} objects. More convenient, because then you can eliminate most of your for i = 1:3 loops and just do vector operations, without them being allocating.

2 Likes

Thank you for the insights, @stevengj.

Would you please be able to elaborate on the following?

As for the static arrays, I tried implementing them at the beginning and I ran into some issues. If I remember correctly, they were effectively immutable due to how I defined them. I thought I would try to get rid of the more annoying allocation issues before looking into using SVectors again. Thank you for the suggestion anyway!

See Access arrays in memory order, along columns

1 Like

Yes, they are immutable — but this means you just treat them like numbers. You don’t try to change the value of 2.0, but you can change the number a variable is assigned to.

For example, if you have an array v of SVectors and you want to change v[i] to a new vector 2*v[i], you can just do:

v[i] = 2*v[i]

Or, if you want change v[i] to a new vector with the 3rd component set to zero, you do (e.g.):

v[i] = @SVector[v[i][1], v[i][2], 0]

This will be efficient — nothing gets allocated on the heap under the hood.

2 Likes

For convenience, there is also StaticArrays.setindex (without a trailing !):

v[i] = setindex(v[i], 0, 3) 
4 Likes