Understanding multi-dimensional array indexing performance

miguelborrero · December 11, 2024, 1:28am

Hi there,

My understanding is that Julia chooses column-major ordering for storing a multi-dimensional array in linear memory. I was testing the performance differences of own-implemented functions that sum over the elements of a matrix using this fact and do not understand why when I use linear indexing the performance is worst. In particular consider the following 3 functions:

function myfunc(matrix)
    s = 0.0
    for col in eachcol(matrix)
        s += sum(col)
    end
    return s
end

function myfunc2(matrix)
    s = 0.0
    for row in eachrow(matrix)
        s += sum(row)
    end
    return s
end

function myfunc3(matrix)
    s = 0.0
    for i in eachindex(matrix)
        s += matrix[i]
    end
    return s
end

With the following benchmarks:

julia> @benchmark myfunc($(rand(100,100)))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.786 μs … 796.010 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.073 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.164 μs ±  12.309 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

julia> @benchmark myfunc2($(rand(100,100)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.208 μs …  1.323 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.375 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.887 μs ± 24.696 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

julia> @benchmark myfunc3($(rand(100,100)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.625 μs …  13.180 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.959 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   33.510 μs ± 137.969 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

Since the Julia manual states:
"When exactly one index i is provided, that index no longer represents a location in a particular dimension of the array. Instead, it selects the ith element using the column-major iteration order that linearly spans the entire array. This is known as linear indexing. "

I was expecting myfunc() and myfunc3() to perform similarly.

Ralph_Smith · December 11, 2024, 3:58am

Without @simd, the loop in myfunc3 must be executed serially for rounding consistent with the expression. With @simd it would be fast but have excessive rounding error. The sum function is not simple, but balances speed and good rounding behavior; however it is slowed by large strides as you showed.

DNF · December 11, 2024, 6:47am

I don’t think that’s right. Simd should improve rounding errors in most cases.

abraemer · December 11, 2024, 7:50am

Imo, adding random floats does not have a well defined accuracy anyways. So the order does not really matter. Sure you could use pairwise summation (which is implemented by sum IIRC) instead of linear summation. So actually the explicit loop adding elements one-by-one will have the worst roundoff error possible. Thus I think @simd can not make matters worse.

foobar_lv2 · December 11, 2024, 10:16am

The thing with floating point re-association has nothing to do with rounding error / accuracy as compared to real numbers, it is rather is about faithfulness to the code-as-written.

For example, x -> (x + c) - c can be used as a kind of rounding operation, unless somebody decides to reassociate floating point (or somebody decides to use autodiff). cf eg julia/base/special/exp.jl at d269d7d375827a0279dc1fee7bb24c9418f06f03 · JuliaLang/julia · GitHub and the linked discussion history of Make math-mode=fast no op, it's too dangerous by PallHaraldsson · Pull Request #41638 · JuliaLang/julia · GitHub

I think I would like some kind of @volatile_math annotation that informs the compiler and reader and tooling that something definitely cannot be re-associated / is known to be ill-conditioned / that Int wrapping is intended. That is probably a hard requirement to enable any kind of tooling to help detect such problems (if you want to gradually roll out a new safety feature, then you always must start by defining how to opt-out of the safety feature, and then annotate old / standard libraries with the opt-out!).

stevengj · December 11, 2024, 12:41pm

Not sure what you mean by this. Floating-point numbers, randomly chosen are not, are just rational numbers, so there is a well-defined exact answer (for any given set of inputs) that you can compare the results of a floating-point algorithm to.

It’s true that if the inputs are random, then the mean error growth will not change for any permutation of the inputs. Building up a single accumulator in sequence (i.e. s += element repeatedly, in any order of the elements) will have rms error growth O(\sqrt{n}) for n inputs. (In Float64 precision, this is often fine until n becomes huge.)

However, pairwise summation does not correspond to a single accumulator — it is a tree of accumulators. That’s why its rms error growth is much better, O(\sqrt{\log n}).

A @simd loop will have the same O(\sqrt{n}) error growth rate as a regular loop, up to constant factors — but its constant factors are probably slightly better because it uses multiple accumulators. (Effectively, using 4 accumulators should be similar to doing only 2 recursive steps of pairwise summation.)

As @foobar_lv2 wrote, the reason why @simd is not the default is that it changes the answer compared to the loop you wrote. The compiler is not trying to minimize the error compared to the exact sum, it’s trying to express the algorithm you wrote, and isn’t allowed to change this unless you give it permission.

abraemer · December 11, 2024, 1:06pm

Yes that what I meant though badly phrased due the concepts being a bit fuzzy in my head So thanks for taking the time to spell it out once more. Besides, I always enjoy reading your explanations of numerical concepts that I probably should know better!

DNF · December 11, 2024, 2:11pm

Just in order to demonstrate that the performance issue is simd-related:

function mysum(r)
    s = zero(eltype(r))
    for i in eachindex(r)
        s += r[i]
    end
    return s
end

function mysum_simd(r)
    s = zero(eltype(r))
    @simd for i in eachindex(r)
        s += r[i]
    end
    return s
end

Benchmark:

julia> r = rand(100,100);

julia> @btime sum($r)
  854.386 ns (0 allocations: 0 bytes)
5025.939581928298

julia> @btime mysum($r)
  8.900 μs (0 allocations: 0 bytes)
5025.939581928294

julia> @btime mysum_simd($r)
  568.108 ns (0 allocations: 0 bytes)
5025.939581928298

Note that the result is also more accurate for the simd-sum, in this case at least (after some more tests, I see this varies a bit, but the simd version is mostly more accurate):

julia> sum(big, r) - mysum(r)
4.055866753560621873475611209869384765625e-12

julia> sum(big, r) - mysum_simd(r)
-4.91606755304019316099584102630615234375e-13

Topic		Replies	Views
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1330	April 17, 2017
A fast sum. Any downsides? Performance sum	18	973	December 16, 2024
How to match performance of sum(A, dims=1)? Performance	14	1109	September 21, 2022
Sum performance for Array{Float64,2} elements Performance	13	2471	May 15, 2018
Reduce memory allocated in array view and in place sum Performance question	12	694	November 10, 2023

Understanding multi-dimensional array indexing performance

Related topics