We’re running the same code on machine A and machine B (same Manifest file), and the results are not the same. We’re doing fairly straight-forward linear algebra on StaticArrays. Is that a bug, or is it generally accepted that the implementation of these algorithms (LAPACK etc.) can vary from one machine to the next?

What kind of machines are they?

- OS
- CPU architecture
- Julia version
- …

I’m not sure; I’m asking more from a point of principle, to figure out if I want to spend time looking for the difference. Is julia generally guaranteeing that `f(x)`

will return exactly the same value across architecture/OS, for its floating point operations? Is there any function in Base where that’s not true? There is the `Int = Int32/Int64`

split which can cause a difference, and obviously multithreading can vary if we didn’t write our code right, but for single-thread code is there anything else?

EDIT: to be clear, I’m talking about a 1e-8 difference in the final result of a calculation. It is probably a very small floating point difference, that was compounded over the large number of operations we perform.

It won’t generally be true if you have `@simd`

or `@fastmath`

.

Optimal performance for some operations (e.g. reductions) will require different orderings on different machines, and this can change rounding.

For example, on a machine with AVX512, starting Julia normally (this will use 256 bit vectors, as preferring them is the default):

```
julia> using Random; Random.seed!(2);
julia> sum(rand(128))
59.893130742026266
```

Starting Julia with `-C"native,-prefer-256-bit"`

(this means “do not prefer 256 bit vectors”, which will cause it to use 512 bit vectors):

```
julia> using Random; Random.seed!(2);
julia> sum(rand(128))
59.89313074202627
```

Starting with `-C"native,prefer-128-bit"`

(do prefer 128 bit vectors):

```
julia> using Random; Random.seed!(2);
julia> sum(rand(128))
59.89313074202625
```

Aside from vector width, the presence/absence of fused multiply add instructions will also make a difference. These instructions return the correctly rounded answer (i.e., not rounding the intermediate product and then rounding again after adding).

`muladd(a,b,c)`

will normally (but not always) use fma instructions when available, and when unavailable will instead compute the less accurate `a * b + c`

.

`Base.exp`

is much less accurate on systems without FMA instructions. As an aside, VectorizationBase/LoopVectorization’s should be similarly accurate whether or not FMA is available, as it switches to double-double arithmetic in that case to compensate; this means that in practice FMA instructions can sometimes provide a far larger performance than just the 2x implied by just combining an addition and subtraction. This would be even more extreme if using `Base.fma`

instead of double-double arithmetic, as `Base.fma`

imposes an even larger performance penalty.