Understanding multi-dimensional array indexing performance

Hi there,

My understanding is that Julia chooses column-major ordering for storing a multi-dimensional array in linear memory. I was testing the performance differences of own-implemented functions that sum over the elements of a matrix using this fact and do not understand why when I use linear indexing the performance is worst. In particular consider the following 3 functions:

function myfunc(matrix)
    s = 0.0
    for col in eachcol(matrix)
        s += sum(col)
    end
    return s
end

function myfunc2(matrix)
    s = 0.0
    for row in eachrow(matrix)
        s += sum(row)
    end
    return s
end

function myfunc3(matrix)
    s = 0.0
    for i in eachindex(matrix)
        s += matrix[i]
    end
    return s
end

With the following benchmarks:

julia> @benchmark myfunc($(rand(100,100)))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.786 μs … 796.010 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.073 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.164 μs ±  12.309 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
julia> @benchmark myfunc2($(rand(100,100)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.208 μs …  1.323 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.375 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.887 μs ± 24.696 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
julia> @benchmark myfunc3($(rand(100,100)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.625 μs …  13.180 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.959 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   33.510 μs ± 137.969 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

Since the Julia manual states:
"When exactly one index i is provided, that index no longer represents a location in a particular dimension of the array. Instead, it selects the ith element using the column-major iteration order that linearly spans the entire array. This is known as linear indexing. "

I was expecting myfunc() and myfunc3() to perform similarly.

Without @simd, the loop in myfunc3 must be executed serially for rounding consistent with the expression. With @simd it would be fast but have excessive rounding error. The sum function is not simple, but balances speed and good rounding behavior; however it is slowed by large strides as you showed.

I don’t think that’s right. Simd should improve rounding errors in most cases.

1 Like

Imo, adding random floats does not have a well defined accuracy anyways. So the order does not really matter. Sure you could use pairwise summation (which is implemented by sum IIRC) instead of linear summation. So actually the explicit loop adding elements one-by-one will have the worst roundoff error possible. Thus I think @simd can not make matters worse.

The thing with floating point re-association has nothing to do with rounding error / accuracy as compared to real numbers, it is rather is about faithfulness to the code-as-written.

For example, x -> (x + c) - c can be used as a kind of rounding operation, unless somebody decides to reassociate floating point (or somebody decides to use autodiff). cf eg julia/base/special/exp.jl at d269d7d375827a0279dc1fee7bb24c9418f06f03 · JuliaLang/julia · GitHub and the linked discussion history of Make math-mode=fast no op, it's too dangerous by PallHaraldsson · Pull Request #41638 · JuliaLang/julia · GitHub

I think I would like some kind of @volatile_math annotation that informs the compiler and reader and tooling that something definitely cannot be re-associated / is known to be ill-conditioned / that Int wrapping is intended. That is probably a hard requirement to enable any kind of tooling to help detect such problems (if you want to gradually roll out a new safety feature, then you always must start by defining how to opt-out of the safety feature, and then annotate old / standard libraries with the opt-out!).

Not sure what you mean by this. Floating-point numbers, randomly chosen are not, are just rational numbers, so there is a well-defined exact answer (for any given set of inputs) that you can compare the results of a floating-point algorithm to.

It’s true that if the inputs are random, then the mean error growth will not change for any permutation of the inputs. Building up a single accumulator in sequence (i.e. s += element repeatedly, in any order of the elements) will have rms error growth O(\sqrt{n}) for n inputs. (In Float64 precision, this is often fine until n becomes huge.)

However, pairwise summation does not correspond to a single accumulator — it is a tree of accumulators. That’s why its rms error growth is much better, O(\sqrt{\log n}).

A @simd loop will have the same O(\sqrt{n}) error growth rate as a regular loop, up to constant factors — but its constant factors are probably slightly better because it uses multiple accumulators. (Effectively, using 4 accumulators should be similar to doing only 2 recursive steps of pairwise summation.)

As @foobar_lv2 wrote, the reason why @simd is not the default is that it changes the answer compared to the loop you wrote. The compiler is not trying to minimize the error compared to the exact sum, it’s trying to express the algorithm you wrote, and isn’t allowed to change this unless you give it permission.

3 Likes

Yes that what I meant though badly phrased due the concepts being a bit fuzzy in my head :slight_smile: So thanks for taking the time to spell it out once more. Besides, I always enjoy reading your explanations of numerical concepts that I probably should know better!

Just in order to demonstrate that the performance issue is simd-related:

function mysum(r)
    s = zero(eltype(r))
    for i in eachindex(r)
        s += r[i]
    end
    return s
end

function mysum_simd(r)
    s = zero(eltype(r))
    @simd for i in eachindex(r)
        s += r[i]
    end
    return s
end

Benchmark:

julia> r = rand(100,100);

julia> @btime sum($r)
  854.386 ns (0 allocations: 0 bytes)
5025.939581928298

julia> @btime mysum($r)
  8.900 μs (0 allocations: 0 bytes)
5025.939581928294

julia> @btime mysum_simd($r)
  568.108 ns (0 allocations: 0 bytes)
5025.939581928298

Note that the result is also more accurate for the simd-sum, in this case at least (after some more tests, I see this varies a bit, but the simd version is mostly more accurate):

julia> sum(big, r) - mysum(r)
4.055866753560621873475611209869384765625e-12

julia> sum(big, r) - mysum_simd(r)
-4.91606755304019316099584102630615234375e-13
2 Likes