# TL;DR

I have some performance critical code that I am trying to optimize as it will otherwise take unacceptably long to run. Testing seems to suggest that the main culprit is assignments to arrays rather than the computation itself, so I am trying to figure out how best to optimize this.

## Maybe useful to know

This is ultimately being run in parallel using `MPI.jl`

. That said, the problem I’m trying to solve here does not require any MPI communication so it can be treated like a single threaded problem.

I do not have the ability to change the data I’m processing as it isn’t produced by me, but I can change how it’s represented in Julia if there are gains to be made there.

# What I’m trying to do

In essence, I have two batches of data that I’ll call A and B. Each batch of data consists of 4 vectors (all of equal length) - 3 coordinate vectors (X, Y, Z) and the value (F) at those coordinates. While all the vectors in each batch of data are the same length, the two batches of data are not necessarily the same length.

For each element of F in A, I want to calculate a value with every element of F in B. At the same time, I also calculate the distance between the two points in each axis (dx, dy, dz). I then store this result in a vector with the index determined by the distance between the two points, where there is one result vector for each axis.

In Julia code, this would look something like:

```
# Note: This is not the actual code so there might be some syntax errors here
struct MyData{T <: AbstractFloat}
X::Vector{T}
Y::Vector{T}
Z::Vector{T}
F::Vector{T}
end
function do_calculation!( outX::Vector{Float64}, outY::Vector{Float64}, outZ::Vector{Float64}, inA::MyData, inB::MyData, stepX::Float64, stepY::Float64, stepZ::Float64 )
for (ax, ay, az, af) in zip( inA.:X, inA.:Y, inA.:Z, inA.:F )
for (bx, by, bz, bf) in zip( inB.:X, inB.:Y, inB.:Z, inB.:F )
idx_x = round( Int, ( ax - bx ) / stepX ) + 1
idx_y = ... # Same as x but with y coordinate
idx_z = ... # Same as x but with z coordinate
value = calc_value( af, bf ) # This is essentially just a multiplication
outX[idx_x] += value
outY[idx_y] += value
outZ[idx_z] += value
end
end
end
```

Running `@time do_calculation`

with the Y and Z assignments commented out takes something like 10 seconds for my sample dataset. Uncommenting Y increases that to 20 seconds, and with all three components it’s something like 30 seconds. I have tested this both by assigning each vector according to its corresponding `idx_*`

as well as commenting out `idx_y`

and `idx_z`

and just assigning to each vector using `idx_x`

. This leads me to conclude that the primary driver of the run time seems to be the array assignments, rather than the calculation of `idx_*`

or `value`

.

There are no unexpected allocations or GC time indicated by `@time`

.

It isn’t surprising to see that extra storage operations take extra time, but I was a little surprised to see that the runtime seems to be dominated by the assignments. I’m curious if there are ways I can speed this up.