Slowdown when writing to too many arrays within a loop

I came across a significant slowdown in my code when attempting to write to too many matrices within one loop. Below is a silly (and slightly long, my apologies) MWE that just writes zeroes. In my actual code, I perform some operations and then store their results.

Basically, the moment I try to write to 9 matrices within the same loop, performance degrades by a factor of >10.

using UnPack

len_A = 15
len_B = 400
len_C = 1198

struct BigMatrices
    matrix1::Array{Float64,3}
    matrix2::Array{Float64,3}
    matrix3::Array{Float64,3}
    matrix4::Array{Float64,3}
    matrix5::Array{Float64,3}
    matrix6::Array{Float64,3}
    matrix7::Array{Float64,3}
    matrix8::Array{Float64,3}
    matrix9::Array{Float64,3}
    matrix10::Array{Float64,3}
    matrix11::Array{Float64,3}
    matrix12::Array{Float64,3}
    matrix13::Array{Float64,3}
    matrix14::Array{Float64,3}
end

bigMatrices = BigMatrices(
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A),
    zeros(len_C, len_B, len_A)
);

function doOperations!(len_A,len_B,len_C,bigMatrices::BigMatrices)
    @unpack matrix1, matrix2, matrix3, matrix4, matrix5, matrix6, matrix7, 
    matrix8, matrix9, matrix10, matrix11, matrix12, matrix13, matrix14 = bigMatrices;

    for a ∈ 1:len_A
        for b ∈ 1:len_B
            for c ∈ 1:len_C
                matrix1[c, b, a]           = 0.0
                matrix2[c, b, a]           = 0.0
                matrix3[c, b, a]           = 0.0
                matrix4[c, b, a]           = 0.0
                matrix5[c, b, a]           = 0.0
                matrix6[c, b, a]           = 0.0
                matrix7[c, b, a]           = 0.0
                matrix8[c, b, a]           = 0.0
            end
        end
    end

    for a ∈ 1:len_A
        for b ∈ 1:len_B
            for c ∈ 1:len_C
                matrix9[c, b, a]            = 0.0
            end
        end
    end

    return nothing
end

function doOperationsSlow!(len_A,len_B,len_C,bigMatrices::BigMatrices)
    @unpack matrix1, matrix2, matrix3, matrix4, matrix5, matrix6, matrix7, 
    matrix8, matrix9, matrix10, matrix11, matrix12, matrix13, matrix14 = bigMatrices;


    for a ∈ 1:len_A
        for b ∈ 1:len_B
            for c ∈ 1:len_C
                matrix1[c, b, a]           = 0.0
                matrix2[c, b, a]           = 0.0
                matrix3[c, b, a]           = 0.0
                matrix4[c, b, a]           = 0.0
                matrix5[c, b, a]           = 0.0
                matrix6[c, b, a]           = 0.0
                matrix7[c, b, a]           = 0.0
                matrix8[c, b, a]           = 0.0
                matrix9[c, b, a]           = 0.0
            end
        end
    end

    return nothing
end
julia> @benchmark doOperations!(len_A,len_B,len_C,bigMatrices)
BenchmarkTools.Trial: 93 samples with 1 evaluation per sample.
 Range (min … max):  31.240 ms … 66.930 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     55.651 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   54.205 ms Β±  7.953 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                               β–‚   β–‚    β–…β–… β–‚β–… β–… β–…β–‚β–‚ β–ˆ β–‚β–‚  β–‚    
  β–…β–β–β–β–β–β–ˆβ–β–β–β–β–β–…β–…β–…β–ˆβ–ˆβ–β–β–…β–…β–…β–β–…β–ˆβ–…β–β–…β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–…β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–ˆβ–ˆβ–ˆβ–β–ˆβ–β–ˆ ▁
  31.2 ms         Histogram: frequency by time        66.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark doOperationsSlow!(len_A,len_B,len_C,bigMatrices)
BenchmarkTools.Trial: 6 samples with 1 evaluation per sample.
 Range (min … max):  813.094 ms … 896.622 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     837.690 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   844.685 ms Β±  32.067 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ β–ˆ             β–ˆ  β–ˆ                   β–ˆ                    β–ˆ  
  β–ˆβ–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  813 ms           Histogram: frequency by time          897 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

I was hoping someone could help me understand what the issue/general principle is here, and how to avoid it? I assume it has to do with memory access, but am not sure when it is or is not something to be careful about.

Poor spatial locality? Your loop is memory-bound, and memory is optimized for reading/writing data stored consecutively. If you generally want to access elements of all 9 matrices together, it will probably be better to store them as an array of structs (so that their elements are stored together) rather than as a struct of arrays (where their elements are stored separately).

3 Likes

This makes sense, thank you.

Can I ask what β€œtogether” and β€œseparately” mean in this context? I understand the advantage of accessing elements that are stored contiguously (as in the column of a matrix), but am not sure that I have an image of how this works in this case. If this question is too basic, please disregard (or if you have a reference, I’d be very thankful)

I couldn’t reproduce the slowdown on my machine, so it may be hardware-/layout-specific. One thing you can try is storing the nine outputs in a single Array{Float64,4} and using the 4th dimension as the β€œoutput index.” This keeps the values you touch in each inner iteration adjacent in memory and reduces the number of concurrent write streams, which often helps cache write-combining.

1 Like

I’ll explain Array of Structs (AoS) and Struct of Arrays (SoA) as stevengj mentioned.


AoS – store full records together.

struct Person
    name::String
    age::Int
    height::Float64
end

people = Person[
    Person("Alice", 30, 1.68),
    Person("Bob",   41, 1.82),
    Person("Cara",  27, 1.60),
]

# Good when each pass touches multiple fields of one person:
avg_bmi = mean(p -> p.age > 30 ? (70.0 / (p.height^2)) : 0.0, people)

SoA – store each field contiguously.

names   = ["Alice", "Bob", "Cara"]
ages    = [30, 41, 27]
heights = [1.68, 1.82, 1.60]

# Good for field-wise passes (streaming/vectorization):
ages .+= 1                      # birthday batch
tall_mask = heights .> 1.70     # fast, contiguous scan of one field

Tips: AoS if kernels use many fields per record; SoA if kernels sweep one field across many records. Consider StructArrays.jl if you want AoS-like syntax with SoA storage.

2 Likes

So, instead of

a1 = [1,2,3]
a2 = [4,5,6]

you have a vector

a = [(1,4), (2,5), (3,6)]

so that a[2] is both values, i.e. a tuple (2, 5).
You can then zero them in one swoop:

a[2] = (0,0)

The following converts individual arrays to this format (EDIT: changed to accomodate different element types):

function aos(arrays::Vararg{Array, N}) where {N}
    allequal(size, arrays) || error("arrays must have same size")
    siz = size(arrays[1])
    # pick up the types
    types = eltype.(arrays)
    tup = Tuple{types...}
    arr = Array{tup, length(siz)}(undef, siz)
    # fill it in
    for idx in eachindex(arr)
        arr[idx] = ntuple(i -> arrays[i][idx]::types[i], Val(N))
    end
    return arr
end

# example
a1 = rand(2,3,4)
a2 = 10*rand(2,3,4)
a3 = 100*rand(2,3,4)
julia> aosarr = aos(a1, a2, a3)
2Γ—3Γ—4 Array{Tuple{Float64, Float64, Float64}, 3}:
[:, :, 1] =
 (0.850998, 8.47985, 76.4457)  (0.314999, 7.3758, 11.2658)  (0.819606, 6.69038, 7.67161)
 (0.62594, 9.82661, 81.9733)   (0.737635, 2.32463, 83.889)  (0.884483, 9.16913, 91.3917)
...
[:, :, 4] =
 (0.813978, 7.27598, 84.246)   (0.102138, 7.33928, 58.4347)  (0.575778, 8.84198, 48.0925)
 (0.382144, 1.15496, 63.4065)  (0.356731, 3.60936, 21.9725)  (0.231331, 5.7963, 91.5765)

There is also a package StructArrays.jl which does the same, but it’s more like a view, so it must be collected to actually reorganize the storage:

julia> aosarr2 = collect(StructArray( (a1, a2, a3)))
2Γ—3Γ—4 Array{Tuple{Float64, Float64, Float64}, 3}:
[:, :, 1] =
 (0.850998, 8.47985, 76.4457)  (0.314999, 7.3758, 11.2658)  (0.819606, 6.69038, 7.67161)
 (0.62594, 9.82661, 81.9733)   (0.737635, 2.32463, 83.889)  (0.884483, 9.16913, 91.3917)
...
[:, :, 4] =
 (0.813978, 7.27598, 84.246)   (0.102138, 7.33928, 58.4347)  (0.575778, 8.84198, 48.0925)
 (0.382144, 1.15496, 63.4065)  (0.356731, 3.60936, 21.9725)  (0.231331, 5.7963, 91.5765)

Then the loop goes like:

for a in 1:lenA, b in 1:lenB, c in 1:lenC
    aosarr[c, b, a] = (e1, e2, e3)
end

Alternatively, you can do as @karei suggests, add an extra dimension and store all the 3-dimensional arrays in a single 4-dimensional array allarrays, with the first dimension going from 1 to 14, or how many arrays you have:

for a in 1:lenA, b in 1:lenB, c in 1:lenC, anum in 1:14
    allarrays[anum, c, b, a] = 0.0
end
1 Like

If I were to hazard a guess, you managed to confuse the llvm auto-vectorizer, and you’re observing the slowdown from simd to scalar code.

The auto-vectorizer checks that your matrices don’t alias (ie two of them point to the same memory).

Why do you need all of the matrices in the same loop? Try making llvm’s job simpler by telling it that @assert size(matrix1) == (len_C,len_B,len_A), and the same for the other matrices. Maybe add @inbounds or @simd or @simd ivdep if applicable.

The kind of code you’re doing – Interleaving, i.e. do one op on matrix1, do one on matrix2, etc, loop to next index – can be very nice in some situations, especially to hide latency or to simd over the sequence of matrices (as opposed to simd over indices), but you need to be very careful not to run out of named (architectural) registers and to not confuse the auto-vectorizer if you believe that simd over indices is appropriate.

Unfortunately, there is afaiu no good julia way of telling the compiler that vectors don’t alias.

3 Likes

Thank you very much for the detailed explanation. This makes a lot of sense, and I think it would also fit well within my application.

@karei’s suggestion on the 4D output is also spot on and I think would work well. The main thing I have so far liked about the SoA approach over, say, the 4D array, is the legibility of having the dedicated struct objects/array names. In my application, these are elements of a dynamic investment model (the. matrices are objects like allFlowDividends, allInvestment, allInvestmentRate within a struct solvObj). I guess one could write convenience functions that preserve the ability to handle these while keeping them β€œclose” in memory? (e.g. by naming views of them)

1 Like

Thanks! Your suggestion gets some performance back, but not all curiously (editing the function also just to remove the superfluous matrix10 to matrix14)


function doOperationsSlowImproved!(len_A,len_B,len_C,bigMatrices::BigMatrices)
    @unpack matrix1, matrix2, matrix3, matrix4, matrix5, matrix6, matrix7, 
    matrix8, matrix9 = bigMatrices;

    @assert size(matrix1) == (len_C,len_B,len_A)
    @assert size(matrix2) == (len_C,len_B,len_A)
    @assert size(matrix3) == (len_C,len_B,len_A)
    @assert size(matrix4) == (len_C,len_B,len_A)
    @assert size(matrix5) == (len_C,len_B,len_A)
    @assert size(matrix6) == (len_C,len_B,len_A)
    @assert size(matrix7) == (len_C,len_B,len_A)
    @assert size(matrix8) == (len_C,len_B,len_A)
    @assert size(matrix9) == (len_C,len_B,len_A)

    @inbounds @simd for a ∈ 1:len_A
        @inbounds @simd for b ∈ 1:len_B
            @inbounds @simd for c ∈ 1:len_C
                matrix1[c, b, a]           = 0.0
                matrix2[c, b, a]           = 0.0
                matrix3[c, b, a]           = 0.0
                matrix4[c, b, a]           = 0.0
                matrix5[c, b, a]           = 0.0
                matrix6[c, b, a]           = 0.0
                matrix7[c, b, a]           = 0.0
                matrix8[c, b, a]           = 0.0
                matrix9[c, b, a]           = 0.0
            end
        end
    end

    return nothing
end
julia> @benchmark doOperationsSlow!(len_A,len_B,len_C,bigMatrices)
BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
 Range (min … max):  438.379 ms … 577.476 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     543.618 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   534.933 ms Β±  40.876 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▁                     ▁                     β–ˆβ– ▁   ▁▁ ▁     ▁  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–ˆβ–β–ˆβ–β–β–β–ˆβ–ˆβ–β–ˆβ–β–β–β–β–β–ˆ ▁
  438 ms           Histogram: frequency by time          577 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

The reason I use the matrices within the same loop is that I pre-compute some objects at each point in the state space of a model (for every value of some variables a,b,c, I compute f(a,b,c), g(a,b,f(a,b,c)) and so on)

Thank you for the explanation! On the system, for reference

julia> versioninfo()
Julia Version 1.11.7
Commit f2b3dbda30a (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 24 Γ— Apple M2 Ultra
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 16 default, 1 interactive, 8 GC (on 16 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_VSCODE_REPL = 1
1 Like

Maybe DimensionalData would let you keep using convenient names along one axis.

1 Like

Try benchmarking with your f and g, etc.

The reason you’re seeing such a giant difference is that your benchmark code is doing a glorified memset, and that can be very fast if appropriate code is generated, or medium-slow speed if the compiler doesn’t see it.

If your other computations take significant time anyways, then there is no opportunity to simd and no big difference.

Whether micro-optimizations make sense depends on the relative speedup for the entire computation. If the code will take 10 minutes with the f and g, then there is little value in fighting over the 800 ms you’re observing here (…and often, the 800ms will vanish like a mirage, because CPUs are good at latency-hiding).

Do you need to provide the matrix sizes as imputs for some reason? Why not rely on the size function instead? And then do

@assert size(matrix1) == size(matrix2) == size(matrix3) == ...

or something like that. Or perhaps

for ind in eachindex(matrix1, matrix2, matrix3, matrix4, ...)
    matrix1[ind]  = 0.0
    matrix2[ind]  = 0.0
    matrix3[ind]  = 0.0
    ...
end
1 Like

In general, providing array sizes is not considered good style, since arrays carry their size information with them. It is both more convenient and safe to use size, eachindex, etc.

1 Like

Here you’re not respecting memory locality recommended by @sgaure and others. The locality itself is good for a 40x speedup:

function doOperationslocal!(len_A,len_B,len_C,bigMatrices::BigMatrices)
          (;matrix1, matrix2, matrix3, matrix4, matrix5, matrix6, matrix7, 
           matrix8, matrix9, matrix10, matrix11, matrix12, matrix13, matrix14) = bigMatrices;
       
          matrix1 .= 0.
          matrix2 .= 0.
          matrix3 .= 0.
          matrix4 .= 0.
          matrix5 .= 0.
          matrix6 .= 0.
          matrix7 .= 0.
          matrix8 .= 0.
          matrix9 .= 0.
          matrix10 .= 0.
          matrix11 .= 0.
          matrix12 .= 0.
          matrix13 .= 0.
          matrix14 .= 0.
       end
doOperationslocal! (generic function with 1 method)

julia> using BenchmarkTools

julia> @benchmark doOperations14!($len_A,$len_B,$len_C,$bigMatrices)
BenchmarkTools.Trial: 18 samples with 1 evaluation per sample.
 Range (min … max):  287.223 ms … 290.204 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     288.999 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   288.671 ms Β± 979.501 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ƒ                                        β–ˆ                    
  β–‡β–ˆβ–β–β–β–‡β–β–β–β–β–β–‡β–β–β–β–‡β–‡β–β–β–β–β–β–β–‡β–β–β–β–β–β–β–‡β–β–β–β–β–β–β–β–β–β–β–‡β–ˆβ–β–‡β–β–β–‡β–‡β–β–β–β–β–β–‡β–β–β–β–β–β–‡ ▁
  287 ms           Histogram: frequency by time          290 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark doOperationslocal!($len_A,$len_B,$len_C,$bigMatrices)
BenchmarkTools.Trial: 575 samples with 1 evaluation per sample.
 Range (min … max):  7.213 ms … 36.534 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     8.683 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   8.692 ms Β±  1.626 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▁ ▁▁▁ ▂▁▄▂ β–…β–ˆβ–†β–                                             
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–„β–„β–β–…β–„β–„β–„β–β–…β–β–…β–„β–β–β–β–β–β–β–β–β–β–„β–β–„β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–„ β–‡
  7.21 ms      Histogram: log(frequency) by time       14 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

My doOperations14! is the same as your original β€œSlow” but zeroing all 14 matrices instead of 9. I’m doing locality a bit differently from @sgaure but should be about the same. I used broadcasting so didn’t need to use @assert.

Thanks to everyone again for their suggestions!

@foobar_lv2 and @apo383: point taken. I apologize, I think my MWE was a bit too β€œminimal,” in the sense that it is not a true analog to what I am trying to do in my code. I was initially just trying to get to the core of what the slowdown was.

In practice, I iterate through points in a three (later five) dimensional space and solve for a series of objects that depend on those values, then store the results in the different matrices. Per my tests with that, both the AoS and the 4D approach perform similarly and much better than my original SoA code. The difference gets massively amplified once I try to parallelize the loops; the more reason to stick to one of the alternatives. The function is a small piece of something that I expect to run thousands of times, so I am looking for the micro-optimizations :slight_smile:

@DNF also point fully taken. the very silly function syntax there was my quick test in response to the suggestion by @foobar_lv2 before. On matrix sizes, fully agreed: I do pass one explicitly within the original function because I preallocate and reuse the storage for an object that can either fully or partially occupy it (as in, a grid that can be of size either 1200 or 2000, depending on the parameters).

@tobydriscoll I was not aware of the DimensionalData package; it looks very promising for my needs. Thank you!