We’re running the same code on machine A and machine B (same Manifest file), and the results are not the same. We’re doing fairly straight-forward linear algebra on StaticArrays. Is that a bug, or is it generally accepted that the implementation of these algorithms (LAPACK etc.) can vary from one machine to the next?
What kind of machines are they?
- OS
- CPU architecture
- Julia version
- …
I’m not sure; I’m asking more from a point of principle, to figure out if I want to spend time looking for the difference. Is julia generally guaranteeing that f(x)
will return exactly the same value across architecture/OS, for its floating point operations? Is there any function in Base where that’s not true? There is the Int = Int32/Int64
split which can cause a difference, and obviously multithreading can vary if we didn’t write our code right, but for single-thread code is there anything else?
EDIT: to be clear, I’m talking about a 1e-8 difference in the final result of a calculation. It is probably a very small floating point difference, that was compounded over the large number of operations we perform.
It won’t generally be true if you have @simd
or @fastmath
.
Optimal performance for some operations (e.g. reductions) will require different orderings on different machines, and this can change rounding.
For example, on a machine with AVX512, starting Julia normally (this will use 256 bit vectors, as preferring them is the default):
julia> using Random; Random.seed!(2);
julia> sum(rand(128))
59.893130742026266
Starting Julia with -C"native,-prefer-256-bit"
(this means “do not prefer 256 bit vectors”, which will cause it to use 512 bit vectors):
julia> using Random; Random.seed!(2);
julia> sum(rand(128))
59.89313074202627
Starting with -C"native,prefer-128-bit"
(do prefer 128 bit vectors):
julia> using Random; Random.seed!(2);
julia> sum(rand(128))
59.89313074202625
Aside from vector width, the presence/absence of fused multiply add instructions will also make a difference. These instructions return the correctly rounded answer (i.e., not rounding the intermediate product and then rounding again after adding).
muladd(a,b,c)
will normally (but not always) use fma instructions when available, and when unavailable will instead compute the less accurate a * b + c
.
Base.exp
is much less accurate on systems without FMA instructions. As an aside, VectorizationBase/LoopVectorization’s should be similarly accurate whether or not FMA is available, as it switches to double-double arithmetic in that case to compensate; this means that in practice FMA instructions can sometimes provide a far larger performance than just the 2x implied by just combining an addition and subtraction. This would be even more extreme if using Base.fma
instead of double-double arithmetic, as Base.fma
imposes an even larger performance penalty.