Optimizing code with array assignments

fergu · June 4, 2023, 6:14pm

TL;DR

I have some performance critical code that I am trying to optimize as it will otherwise take unacceptably long to run. Testing seems to suggest that the main culprit is assignments to arrays rather than the computation itself, so I am trying to figure out how best to optimize this.

Maybe useful to know

This is ultimately being run in parallel using MPI.jl. That said, the problem I’m trying to solve here does not require any MPI communication so it can be treated like a single threaded problem.

I do not have the ability to change the data I’m processing as it isn’t produced by me, but I can change how it’s represented in Julia if there are gains to be made there.

What I’m trying to do

In essence, I have two batches of data that I’ll call A and B. Each batch of data consists of 4 vectors (all of equal length) - 3 coordinate vectors (X, Y, Z) and the value (F) at those coordinates. While all the vectors in each batch of data are the same length, the two batches of data are not necessarily the same length.

For each element of F in A, I want to calculate a value with every element of F in B. At the same time, I also calculate the distance between the two points in each axis (dx, dy, dz). I then store this result in a vector with the index determined by the distance between the two points, where there is one result vector for each axis.

In Julia code, this would look something like:

# Note: This is not the actual code so there might be some syntax errors here
struct MyData{T <: AbstractFloat}
    X::Vector{T}
    Y::Vector{T}
    Z::Vector{T}
    F::Vector{T}
end

function do_calculation!( outX::Vector{Float64}, outY::Vector{Float64}, outZ::Vector{Float64}, inA::MyData, inB::MyData, stepX::Float64, stepY::Float64, stepZ::Float64 )
    for (ax, ay, az, af) in zip( inA.:X, inA.:Y, inA.:Z, inA.:F )
        for (bx, by, bz, bf) in zip( inB.:X, inB.:Y, inB.:Z, inB.:F )
            idx_x = round( Int, ( ax - bx ) / stepX ) + 1
            idx_y = ... # Same as x but with y coordinate
            idx_z = ... # Same as x but with z coordinate
            value = calc_value( af, bf ) # This is essentially just a multiplication
            outX[idx_x] += value
            outY[idx_y] += value
            outZ[idx_z] += value
        end
    end
end

Running @time do_calculation with the Y and Z assignments commented out takes something like 10 seconds for my sample dataset. Uncommenting Y increases that to 20 seconds, and with all three components it’s something like 30 seconds. I have tested this both by assigning each vector according to its corresponding idx_* as well as commenting out idx_y and idx_z and just assigning to each vector using idx_x. This leads me to conclude that the primary driver of the run time seems to be the array assignments, rather than the calculation of idx_* or value.

There are no unexpected allocations or GC time indicated by @time.

It isn’t surprising to see that extra storage operations take extra time, but I was a little surprised to see that the runtime seems to be dominated by the assignments. I’m curious if there are ways I can speed this up.

Henrique_Becker · June 4, 2023, 6:21pm

I am not entirely sure how to test this in Julia, but your bottleneck may be the cache. If you very frequently access “random” positions in a sufficiently large array, then the time of the computation will be dominated by the time necessary to get this information from RAM and copy it to the processor cache. In C/C++, I think I used the valgrind tool to monitor the number of cache/page misses.

fergu · June 4, 2023, 6:31pm

That’s a good point. The more-or-less random assignments to the arrays was something I’ve wondered about too. Maybe a potential optimization would be to try to work out a way to make the assignments more regular (by sorting the inputs or something like that), or reduce the size of the array I’m assigning to?

fatteneder · June 4, 2023, 6:49pm

I guess you could test that hypothesis by pushing the value as well as idx_x,idx_y,idx_z into two preallocated vectors (use sizehint! to avoid reallocations inside the loop and push the three indices as a triple into the same vector), and then shift the construction outX, outY, outZ to a later stage.

EDIT: I overlooked that you add value in every iteration to different indices in outX, outY, outZ, so I guess doing what I suggested would just shift the memory access costs just to the assembly stage …

gdalle · June 4, 2023, 6:53pm

What happens if you make three separate loops, one for outX, one for outY and one for outZ? That way Julia only has to deal with one of these vectors at a time

fergu · June 4, 2023, 8:11pm

I haven’t tried exactly this, but an earlier iteration of this code effectively replaced the inner loop with their vectorized equivalents. So something like

function assign_values_to_array!( output, in_idx, in_value )
    for (idx, value) in zip( in_idx, in_value )
        output[idx] = value
    end
end

# Inside of the for(ax, ay ...) loop...
idx_x = round.( Int, abs.(ax .- inB.:X) / stepX )
value = calc_value.( af, inB.:F )
assign_values_to_array!( outX, idx_x, value )
# And similarly for Y and Z

and as I recall, this approach took about the same amount of time, but did require additional allocations and GC time that my current approach doesn’t need, which was the motivation for changing to the current approach.

I’ll definitely try restructuring this function to work on one dimension at a time rather than all 3 and see where that lands me. That would also make the function more amenable to sorting inputs which might help with paging issues if that turns out to be the problem.

DNF · June 4, 2023, 8:57pm

What’s up with the : in field access? You can just write Ina.X,etc.

fatteneder · June 4, 2023, 8:58pm

fergu:

# Inside of the for(ax, ay ...) loop...
idx_x = round.( Int, abs.(ax .- inB.:X) / stepX )
value = calc_value.( af, inB.:F )
assign_values_to_array!( outX, idx_x, value )
# And similarly for Y and Z

Such allocations can be fine if they are just done once (and not inside the loop which causes repeated GC pauses), which I think is the case in your example, or?

And am I wrong, or couldn’t assign_values_to_array! be just written as (also without allocations)

outX[idx_x] .= value

There have been already some ideas floating around here.
So if you could provide some dummy data and MWE of your current best approach, then we could test some of them or even find other optimizations.

fergu · June 15, 2023, 1:48am

I’m sorry for taking some time to get back to this. Had a few other things come up that took my focus away from this task, but I’m now back to looking at this.

A basic script that gets at the general concept of what I’m trying to do might look like this:

function generate_data( nx::Int, ny::Int, nz::Int )::Vector{Float64}
    # I'm demonstrating this as a 1D vector, but it's also an option to have this be a 3D array
    # e.g. rand ( nx, ny, nz )
    # if there's something to be gained by taking that approach
    return rand(Float64, nx * ny * nz )
end

function process_data!( output::Vector{T}, input_a::Vector{T}, input_b::Vector{T}, n::Int ) where { T <: AbstractFloat }
    len_a = length(input_a)
    len_b = length(input_b)
    len_o = length(output)
    for i_idx in 1:len_a
        for out_idx in 1:len_o
            output_value = 0.0
            for j_idx in out_idx:n:len_b
                output_value += input_a[i_idx] * input_b[j_idx]
            end
            output[out_idx] += output_value
        end
    end
end

# Here I am making the data square, but it does not have to be
nx = 55
ny = 55
nz = 55

input_a = generate_data( nx, ny, nz )
input_b = generate_data( nx, ny, nz )
output = zeros( nx )

@time process_data!( output, input_a, input_b, nx )
println( output )

In testing, I find this takes ~40 seconds to complete. For comparison, I wrote a C++ program to do this same test problem (with the same loop structure) and it completes in something like 5 seconds – so I know there’s definitely improvements to be made here. @time does report some allocations the first time the function is run, but the second run reports no allocations.

At this point I’m mainly curious to understand what optimizations I’m missing out on/mistakes I’m making, as Julia being 8x slower is surprising to me. I would have expected the two to be similar in performance for a simple test problem like this, so I’m tempted to think I’m doing something wrong.

I have tried the @fastmath and @simd decorators on the function and loops respectively, but they don’t appear to have much, if any, of an effect. I have also tried @inbounds, also without much of an influence on the result.

PeterSimon · June 15, 2023, 2:48am

I don’t see anything wrong with your code except for using output_value = 0.0 which should be output_value = zero(T) to avoid type instability if you pass in other than Float64 arrays. I was able to shave some time off by simplifying the code somewhat:

function process_data2!( output::Vector{T}, input_a::Vector{T}, input_b::Vector{T}, n::Int ) where { T <: AbstractFloat }
    len_b = length(input_b)
    for ai in input_a
        for out_idx in eachindex(output)
            output[out_idx] += ai * sum(@view input_b[out_idx:n:len_b])
        end
    end
end

The timing code:

output = zeros( nx )
@time process_data!( output, input_a, input_b, nx );

output2 = zeros( nx )
@time out2 = process_data2!( output2, input_a, input_b, nx );

@show maximum(abs.(output .- output2))

which produces the following output:

 30.923718 seconds (6.97 k allocations: 474.282 KiB, 0.05% compilation time)
 26.666368 seconds (5.30 k allocations: 359.182 KiB, 0.06% compilation time)
maximum(abs.(output .- output2)) = 1.6391277313232422e-7

Are you sure the C++ code is doing the identical work? If so, then maybe some of the other experts here can help further.

DNF · June 15, 2023, 7:38am

Perhaps the script doesn’t capture the actual problem you are working with, but this function

function process!( out::Vector{T}, a::Vector{T}, b::Vector{T}, n::Integer ) where { T <: AbstractFloat }
    asum = sum(a)
    for j in eachindex(out)
        val = zero(T)
        for k in j:n:lastindex(b)
            val += asum * b[k]
        end
        out[j] += val
    end
    return out
end

produces the same output, and runs in 250μs on my laptop.

This one is almost as fast:

function process_!( out::Vector{T}, a::Vector{T}, b::Vector{T}, n::Integer ) where { T <: AbstractFloat }
    asum = sum(a)
    for j in eachindex(out)
        out[j] = asum * sum(@view b[j:n:end])
    end
    return out
end

GunnarFarneback · June 15, 2023, 7:45am

With the risk of over-optimizing your simplified problem in a way that doesn’t generalize to your real problem, a switch of loop order and using LoopVectorization yields an order of magnitude improvement:

using LoopVectorization
function process_data2!( output::Vector{T}, input_a::Vector{T}, input_b::Vector{T}, n::Int ) where { T <: AbstractFloat }
    len_a = length(input_a)
    len_b = length(input_b)
    len_o = length(output)
    for out_idx in 1:len_o
        output_value = zero(T)
        @turbo for i_idx in 1:len_a
            s = zero(T)
            for j_idx in out_idx:n:len_b
                @inbounds s += input_a[i_idx] * input_b[j_idx]
            end
            output_value += s
        end
        @inbounds output[out_idx] = output_value
    end
end

A more aggressive algorithm optimization yields another three orders of magnitude improvement, even without LoopVectorization:

 function process_data3!( output::Vector{T}, input_a::Vector{T}, input_b::Vector{T}, n::Int ) where { T <: AbstractFloat }
    len_b = length(input_b)
    len_o = length(output)
    s1 = sum(input_a)
    for out_idx in 1:len_o
        s2 = zero(T)
        for j_idx in out_idx:n:len_b
            @inbounds s2 += input_b[j_idx]
        end
        @inbounds output[out_idx] = s1 * s2
    end
end

DNF · June 15, 2023, 7:48am

I will also make a pitch for shorter variable names:

I think you would have seen those simplifications more easily if you used a instead of input_a, i instead of i_idx, etc. There’s no more information in the name input_b than in the name b, and longer variable names, especially with underscores, especially if there are several similar-looking ones, like input_a, input_b etc. makes it harder to get the overall picture of what’s going on. You get the ‘wall of text’ effect.

Furthermore, using eachindex etc. instead of naming the length of the vectors and creating 1:len_o etc. also simplifies and clarifies what is going on. Simpler code, fewer names, and simpler names can help you improve readability, as well as quality, correctness and performance.

Now, your real case is maybe less easy to simplify, but with shorter variable names (especially compared to needlessly long compound names) you can maybe also simplify the real case.

Topic		Replies	Views
Looking for Some Best Practices for Optimizing Julia Code Performance? General Usage	5	256	December 23, 2024
For loop optimization Performance performance , memory-allocation , loops	3	1638	October 31, 2021
help understanding performance with assignment to allocated matrix Performance question	6	1671	January 4, 2018
Memory best practices in Julia with Arrays vs Vectors New to Julia memory-allocation , arrays , staticarrays , simulations	19	614	August 18, 2024
Optimizing Linear Algebra Code? Performance linearalgebra	12	1344	April 9, 2021

Optimizing code with array assignments

TL;DR

Maybe useful to know

What I’m trying to do

Related topics