Optimizing code with array assignments

TL;DR

I have some performance critical code that I am trying to optimize as it will otherwise take unacceptably long to run. Testing seems to suggest that the main culprit is assignments to arrays rather than the computation itself, so I am trying to figure out how best to optimize this.

Maybe useful to know

This is ultimately being run in parallel using MPI.jl. That said, the problem I’m trying to solve here does not require any MPI communication so it can be treated like a single threaded problem.

I do not have the ability to change the data I’m processing as it isn’t produced by me, but I can change how it’s represented in Julia if there are gains to be made there.

What I’m trying to do

In essence, I have two batches of data that I’ll call A and B. Each batch of data consists of 4 vectors (all of equal length) - 3 coordinate vectors (X, Y, Z) and the value (F) at those coordinates. While all the vectors in each batch of data are the same length, the two batches of data are not necessarily the same length.

For each element of F in A, I want to calculate a value with every element of F in B. At the same time, I also calculate the distance between the two points in each axis (dx, dy, dz). I then store this result in a vector with the index determined by the distance between the two points, where there is one result vector for each axis.

In Julia code, this would look something like:

# Note: This is not the actual code so there might be some syntax errors here
struct MyData{T <: AbstractFloat}
    X::Vector{T}
    Y::Vector{T}
    Z::Vector{T}
    F::Vector{T}
end

function do_calculation!( outX::Vector{Float64}, outY::Vector{Float64}, outZ::Vector{Float64}, inA::MyData, inB::MyData, stepX::Float64, stepY::Float64, stepZ::Float64 )
    for (ax, ay, az, af) in zip( inA.:X, inA.:Y, inA.:Z, inA.:F )
        for (bx, by, bz, bf) in zip( inB.:X, inB.:Y, inB.:Z, inB.:F )
            idx_x = round( Int, ( ax - bx ) / stepX ) + 1
            idx_y = ... # Same as x but with y coordinate
            idx_z = ... # Same as x but with z coordinate
            value = calc_value( af, bf ) # This is essentially just a multiplication
            outX[idx_x] += value
            outY[idx_y] += value
            outZ[idx_z] += value
        end
    end
end

Running @time do_calculation with the Y and Z assignments commented out takes something like 10 seconds for my sample dataset. Uncommenting Y increases that to 20 seconds, and with all three components it’s something like 30 seconds. I have tested this both by assigning each vector according to its corresponding idx_* as well as commenting out idx_y and idx_z and just assigning to each vector using idx_x. This leads me to conclude that the primary driver of the run time seems to be the array assignments, rather than the calculation of idx_* or value.

There are no unexpected allocations or GC time indicated by @time.

It isn’t surprising to see that extra storage operations take extra time, but I was a little surprised to see that the runtime seems to be dominated by the assignments. I’m curious if there are ways I can speed this up.

I am not entirely sure how to test this in Julia, but your bottleneck may be the cache. If you very frequently access “random” positions in a sufficiently large array, then the time of the computation will be dominated by the time necessary to get this information from RAM and copy it to the processor cache. In C/C++, I think I used the valgrind tool to monitor the number of cache/page misses.

2 Likes

That’s a good point. The more-or-less random assignments to the arrays was something I’ve wondered about too. Maybe a potential optimization would be to try to work out a way to make the assignments more regular (by sorting the inputs or something like that), or reduce the size of the array I’m assigning to?

1 Like

I guess you could test that hypothesis by pushing the value as well as idx_x,idx_y,idx_z into two preallocated vectors (use sizehint! to avoid reallocations inside the loop and push the three indices as a triple into the same vector), and then shift the construction outX, outY, outZ to a later stage.

EDIT: I overlooked that you add value in every iteration to different indices in outX, outY, outZ, so I guess doing what I suggested would just shift the memory access costs just to the assembly stage …

2 Likes

What happens if you make three separate loops, one for outX, one for outY and one for outZ? That way Julia only has to deal with one of these vectors at a time

1 Like

I haven’t tried exactly this, but an earlier iteration of this code effectively replaced the inner loop with their vectorized equivalents. So something like

function assign_values_to_array!( output, in_idx, in_value )
    for (idx, value) in zip( in_idx, in_value )
        output[idx] = value
    end
end

# Inside of the for(ax, ay ...) loop...
idx_x = round.( Int, abs.(ax .- inB.:X) / stepX )
value = calc_value.( af, inB.:F )
assign_values_to_array!( outX, idx_x, value )
# And similarly for Y and Z

and as I recall, this approach took about the same amount of time, but did require additional allocations and GC time that my current approach doesn’t need, which was the motivation for changing to the current approach.

I’ll definitely try restructuring this function to work on one dimension at a time rather than all 3 and see where that lands me. That would also make the function more amenable to sorting inputs which might help with paging issues if that turns out to be the problem.

What’s up with the : in field access? You can just write Ina.X,etc.

Such allocations can be fine if they are just done once (and not inside the loop which causes repeated GC pauses), which I think is the case in your example, or?

And am I wrong, or couldn’t assign_values_to_array! be just written as (also without allocations)

outX[idx_x] .= value

There have been already some ideas floating around here.
So if you could provide some dummy data and MWE of your current best approach, then we could test some of them or even find other optimizations.

I’m sorry for taking some time to get back to this. Had a few other things come up that took my focus away from this task, but I’m now back to looking at this.

A basic script that gets at the general concept of what I’m trying to do might look like this:

function generate_data( nx::Int, ny::Int, nz::Int )::Vector{Float64}
    # I'm demonstrating this as a 1D vector, but it's also an option to have this be a 3D array
    # e.g. rand ( nx, ny, nz )
    # if there's something to be gained by taking that approach
    return rand(Float64, nx * ny * nz )
end

function process_data!( output::Vector{T}, input_a::Vector{T}, input_b::Vector{T}, n::Int ) where { T <: AbstractFloat }
    len_a = length(input_a)
    len_b = length(input_b)
    len_o = length(output)
    for i_idx in 1:len_a
        for out_idx in 1:len_o
            output_value = 0.0
            for j_idx in out_idx:n:len_b
                output_value += input_a[i_idx] * input_b[j_idx]
            end
            output[out_idx] += output_value
        end
    end
end

# Here I am making the data square, but it does not have to be
nx = 55
ny = 55
nz = 55

input_a = generate_data( nx, ny, nz )
input_b = generate_data( nx, ny, nz )
output = zeros( nx )

@time process_data!( output, input_a, input_b, nx )
println( output )

In testing, I find this takes ~40 seconds to complete. For comparison, I wrote a C++ program to do this same test problem (with the same loop structure) and it completes in something like 5 seconds – so I know there’s definitely improvements to be made here. @time does report some allocations the first time the function is run, but the second run reports no allocations.

At this point I’m mainly curious to understand what optimizations I’m missing out on/mistakes I’m making, as Julia being 8x slower is surprising to me. I would have expected the two to be similar in performance for a simple test problem like this, so I’m tempted to think I’m doing something wrong.

I have tried the @fastmath and @simd decorators on the function and loops respectively, but they don’t appear to have much, if any, of an effect. I have also tried @inbounds, also without much of an influence on the result.

I don’t see anything wrong with your code except for using output_value = 0.0 which should be output_value = zero(T) to avoid type instability if you pass in other than Float64 arrays. I was able to shave some time off by simplifying the code somewhat:

function process_data2!( output::Vector{T}, input_a::Vector{T}, input_b::Vector{T}, n::Int ) where { T <: AbstractFloat }
    len_b = length(input_b)
    for ai in input_a
        for out_idx in eachindex(output)
            output[out_idx] += ai * sum(@view input_b[out_idx:n:len_b])
        end
    end
end

The timing code:

output = zeros( nx )
@time process_data!( output, input_a, input_b, nx );

output2 = zeros( nx )
@time out2 = process_data2!( output2, input_a, input_b, nx );

@show maximum(abs.(output .- output2))

which produces the following output:

 30.923718 seconds (6.97 k allocations: 474.282 KiB, 0.05% compilation time)
 26.666368 seconds (5.30 k allocations: 359.182 KiB, 0.06% compilation time)
maximum(abs.(output .- output2)) = 1.6391277313232422e-7

Are you sure the C++ code is doing the identical work? If so, then maybe some of the other experts here can help further.

Perhaps the script doesn’t capture the actual problem you are working with, but this function

function process!( out::Vector{T}, a::Vector{T}, b::Vector{T}, n::Integer ) where { T <: AbstractFloat }
    asum = sum(a)
    for j in eachindex(out)
        val = zero(T)
        for k in j:n:lastindex(b)
            val += asum * b[k]
        end
        out[j] += val
    end
    return out
end

produces the same output, and runs in 250μs on my laptop.

This one is almost as fast:

function process_!( out::Vector{T}, a::Vector{T}, b::Vector{T}, n::Integer ) where { T <: AbstractFloat }
    asum = sum(a)
    for j in eachindex(out)
        out[j] = asum * sum(@view b[j:n:end])
    end
    return out
end
1 Like

With the risk of over-optimizing your simplified problem in a way that doesn’t generalize to your real problem, a switch of loop order and using LoopVectorization yields an order of magnitude improvement:

using LoopVectorization
function process_data2!( output::Vector{T}, input_a::Vector{T}, input_b::Vector{T}, n::Int ) where { T <: AbstractFloat }
    len_a = length(input_a)
    len_b = length(input_b)
    len_o = length(output)
    for out_idx in 1:len_o
        output_value = zero(T)
        @turbo for i_idx in 1:len_a
            s = zero(T)
            for j_idx in out_idx:n:len_b
                @inbounds s += input_a[i_idx] * input_b[j_idx]
            end
            output_value += s
        end
        @inbounds output[out_idx] = output_value
    end
end

A more aggressive algorithm optimization yields another three orders of magnitude improvement, even without LoopVectorization:

 function process_data3!( output::Vector{T}, input_a::Vector{T}, input_b::Vector{T}, n::Int ) where { T <: AbstractFloat }
    len_b = length(input_b)
    len_o = length(output)
    s1 = sum(input_a)
    for out_idx in 1:len_o
        s2 = zero(T)
        for j_idx in out_idx:n:len_b
            @inbounds s2 += input_b[j_idx]
        end
        @inbounds output[out_idx] = s1 * s2
    end
end
1 Like

I will also make a pitch for shorter variable names:

I think you would have seen those simplifications more easily if you used a instead of input_a, i instead of i_idx, etc. There’s no more information in the name input_b than in the name b, and longer variable names, especially with underscores, especially if there are several similar-looking ones, like input_a, input_b etc. makes it harder to get the overall picture of what’s going on. You get the ‘wall of text’ effect.

Furthermore, using eachindex etc. instead of naming the length of the vectors and creating 1:len_o etc. also simplifies and clarifies what is going on. Simpler code, fewer names, and simpler names can help you improve readability, as well as quality, correctness and performance.

Now, your real case is maybe less easy to simplify, but with shorter variable names (especially compared to needlessly long compound names) you can maybe also simplify the real case.

1 Like