My understanding is that Julia chooses column-major ordering for storing a multi-dimensional array in linear memory. I was testing the performance differences of own-implemented functions that sum over the elements of a matrix using this fact and do not understand why when I use linear indexing the performance is worst. In particular consider the following 3 functions:
function myfunc(matrix)
s = 0.0
for col in eachcol(matrix)
s += sum(col)
end
return s
end
function myfunc2(matrix)
s = 0.0
for row in eachrow(matrix)
s += sum(row)
end
return s
end
function myfunc3(matrix)
s = 0.0
for i in eachindex(matrix)
s += matrix[i]
end
return s
end
Since the Julia manual states:
"When exactly one index i is provided, that index no longer represents a location in a particular dimension of the array. Instead, it selects the ith element using the column-major iteration order that linearly spans the entire array. This is known as linear indexing. "
I was expecting myfunc() and myfunc3() to perform similarly.
Without @simd, the loop in myfunc3 must be executed serially for rounding consistent with the expression. With @simd it would be fast but have excessive rounding error. The sum function is not simple, but balances speed and good rounding behavior; however it is slowed by large strides as you showed.
Imo, adding random floats does not have a well defined accuracy anyways. So the order does not really matter. Sure you could use pairwise summation (which is implemented by sum IIRC) instead of linear summation. So actually the explicit loop adding elements one-by-one will have the worst roundoff error possible. Thus I think @simd can not make matters worse.
The thing with floating point re-association has nothing to do with rounding error / accuracy as compared to real numbers, it is rather is about faithfulness to the code-as-written.
I think I would like some kind of @volatile_math annotation that informs the compiler and reader and tooling that something definitely cannot be re-associated / is known to be ill-conditioned / that Int wrapping is intended. That is probably a hard requirement to enable any kind of tooling to help detect such problems (if you want to gradually roll out a new safety feature, then you always must start by defining how to opt-out of the safety feature, and then annotate old / standard libraries with the opt-out!).
Not sure what you mean by this. Floating-point numbers, randomly chosen are not, are just rational numbers, so there is a well-defined exact answer (for any given set of inputs) that you can compare the results of a floating-point algorithm to.
It’s true that if the inputs are random, then the mean error growth will not change for any permutation of the inputs. Building up a single accumulator in sequence (i.e. s += element repeatedly, in any order of the elements) will have rms error growth O(\sqrt{n}) for n inputs. (In Float64 precision, this is often fine until n becomes huge.)
However, pairwise summation does not correspond to a single accumulator — it is a tree of accumulators. That’s why its rms error growth is much better, O(\sqrt{\log n}).
A @simd loop will have the same O(\sqrt{n}) error growth rate as a regular loop, up to constant factors — but its constant factors are probably slightly better because it uses multiple accumulators. (Effectively, using 4 accumulators should be similar to doing only 2 recursive steps of pairwise summation.)
As @foobar_lv2 wrote, the reason why @simd is not the default is that it changes the answer compared to the loop you wrote. The compiler is not trying to minimize the error compared to the exact sum, it’s trying to express the algorithm you wrote, and isn’t allowed to change this unless you give it permission.
Yes that what I meant though badly phrased due the concepts being a bit fuzzy in my head So thanks for taking the time to spell it out once more. Besides, I always enjoy reading your explanations of numerical concepts that I probably should know better!
Just in order to demonstrate that the performance issue is simd-related:
function mysum(r)
s = zero(eltype(r))
for i in eachindex(r)
s += r[i]
end
return s
end
function mysum_simd(r)
s = zero(eltype(r))
@simd for i in eachindex(r)
s += r[i]
end
return s
end
Note that the result is also more accurate for the simd-sum, in this case at least (after some more tests, I see this varies a bit, but the simd version is mostly more accurate):