How to achieve numerical reproducibility across platforms for scientific replication package

I am preparing the replication package for an academic publication, and I am running into issues that my code gives different results on my local Windows 11 machine and Unix university servers. I am using the same version of Julia (1.9.2) and the exact same environment (see below).

The issue is that my model computation function sometimes produces results that are numerically very slightly different on Windows vs Unix. I am estimating the parameters of the model (I use curve_fit in LsqFit.jl), and these initial differences seem to propagate and lead to different results eventually. (These do not seem to be major differences that would change any substantive conclusion, but they are numerically quite different, at 1e-1 or 1e-2 level for parameters on the order of 1e2.)

I hesitate to post my code here because it is long and complicated, but the key parts seem mostly package-free (they compute many exp, log and sums, and some iterations with some level of precision). I typically use NonNegLeastSquares.jl at some point, although I checked that the problem also exists without it.

I have read (with horror) that this happens in certain linear algebra and other libraries

Is there a way to ensure numerically exact results cross-platform in Julia?
Or, more specifically in my case, is there any advice on how I might go about to have reproducible code?

Here is my environment:

  [fbb218c0] BSON v0.3.7
  [6e4b80f9] BenchmarkTools v1.3.2
  [b4567568] CRC32 v1.0.1
  [336ed68f] CSV v0.10.11
  [a93c6f00] DataFrames v1.6.1
⌃ [31c24e10] Distributions v0.25.99
⌃ [26cc04aa] FiniteDifferences v0.12.29
  [9d5cd8c9] FixedEffectModels v1.9.4
⌃ [f6369f11] ForwardDiff v0.10.35
⌃ [da1fdf0e] FreqTables v0.4.5
⌃ [38e38edf] GLM v1.8.3
  [c27321d9] Glob v1.3.1
  [842dd82b] InlineStrings v1.4.0
  [a98d9a8b] Interpolations v0.14.7
  [682c06a0] JSON v0.21.4
  [d3d80556] LineSearches v7.2.0
⌃ [4345ca2d] Loess v0.6.1
  [2fda8390] LsqFit v0.13.0 `https://github.com/JuliaNLSolvers/LsqFit.jl#master`
  [b7351bd1] NonNegLeastSquares v0.4.0
⌃ [429524aa] Optim v1.7.6
⌃ [91a5bcdd] Plots v1.38.17
⌃ [f2b01f46] Roots v2.0.17
⌃ [276daf66] SpecialFunctions v2.3.0
  [1463e38c] StatFiles v0.8.0
⌅ [2913bbd2] StatsBase v0.33.21
  [40c74d1a] TableView v0.7.2
  [ade2ca70] Dates
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random
  [10745b16] Statistics v1.9.0

Any randomness? Do you only use IEEE arithmetic?

Might as well post the code.

Sure, I am attaching the key code file.

functions-gmm-full.jl (36.7 KB)
functions-model.jl (60.2 KB)
functions-model-definitions.jl (17.2 KB)
gmm_wrappers.jl (16.6 KB)

  • To estimate parameters, I call curve_fit from LsqFit.jl in the file gmm_wrappers.jl. See below the first round of results, which are super slightly different. These differences later grow.
  • the key function to compute the model is model_dt_dynamic() in functions-model-definitions.jl
  • I’ve used checksums to verify that DATA, PARAMS are identical on Windows/Linux
  • unless LsqFit is generating slightly different theta inputs, those should be identical also.
  • I tried a couple of times to use the values of theta below directly, and I do seem to get the same result on Windows / Linux. I did not do this systematically.
Results on local machine (Windows):

[1st stage] iter 1: starting
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.045718475638708116
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.045718475638708116
theta [102.12199794490081, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.045718536449015974  <----------
theta [102.1207611621758, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.04571841482194174
theta [102.1213795535383, 108.68133385621667, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.045718635328641316
theta [102.1213795535383, 108.68001763445301, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.04571831594289727    <----------
theta [102.1213795535383, 108.68067574533484, 6.0554544523933395e-6, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.04571847411729412
theta [102.1213795535383, 108.68067574533484, -6.0554544523933395e-6, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.0457184771601207
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54554413899048, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.045718362519457996   <----------
theta [102.1213795535383, 108.68067574533484, 0.0, 34.5451257635858, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.04571858875926351     <----------
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.283430119697144, 22.264706885783603, 13.620450125437278] 0.04571846316407449
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28280903450759, 22.264706885783603, 13.620450125437278] 0.045718488113449605   <----------
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264841708702047, 13.620450125437278] 0.04571843169465825
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.26457206286516, 13.620450125437278] 0.0457185195832849
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620532603452634] 0.04571842651225944
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620367647421922] 0.04571852475945836
     0     4.571848e-02              NaN
 * lambda: 10.0


Results on server (Unix):

[1st stage] iter 1: starting
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.045718475638708116
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.045718475638708116
theta [102.12199794490081, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.04571853644901597   <----------
theta [102.1207611621758, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.04571841482194174
theta [102.1213795535383, 108.68133385621667, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.045718635328641316
theta [102.1213795535383, 108.68001763445301, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.045718315942897284   <----------
theta [102.1213795535383, 108.68067574533484, 6.0554544523933395e-6, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.04571847411729412
theta [102.1213795535383, 108.68067574533484, -6.0554544523933395e-6, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.0457184771601207
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54554413899048, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.04571836251945799    <----------
theta [102.1213795535383, 108.68067574533484, 0.0, 34.5451257635858, 51.28311957710237, 22.264706885783603, 13.620450125437278] 0.0457185887592635      <----------
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.283430119697144, 22.264706885783603, 13.620450125437278] 0.04571846316407449
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28280903450759, 22.264706885783603, 13.620450125437278] 0.0457184881134496     <----------
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264841708702047, 13.620450125437278] 0.04571843169465825
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.26457206286516, 13.620450125437278] 0.0457185195832849
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620532603452634] 0.04571842651225944
theta [102.1213795535383, 108.68067574533484, 0.0, 34.54533495128814, 51.28311957710237, 22.264706885783603, 13.620367647421922] 0.04571852475945836
     0     4.571848e-02              NaN
 * lambda: 10.0
1 Like

No randomness that I could find.
I don’t know exactly what FiniteDiff.jl does. LsqFit.jl uses this to compute gradients numerically.
In any case, I set Random.seed!(123) just before I start the optimization. So my guess is that this is not what’s going on.

I suspect that differences in hardware may be more interesting than a difference in operating system. What processors do you have on respective system?

1 Like

I’ve read in the docs that at least across Julia versions, the random number generation is not guaranteed to be exactly reproducible. Not sure if the same applies across platforms as well. My guess would be that the numbers are the same on both platforms anyway, but just to be sure you could check it.

On Hardware, Windows 11 Enterprise on:

Processor	12th Gen Intel(R) Core(TM) i9-12900KF   3.19 GHz
Installed RAM	128 GB (128 GB usable)
System type	64-bit operating system, x64-based processor

vs for the university server, all I can easily find is this

Intel "Cascade Lake"

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  1
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
Stepping:            7
CPU MHz:             2900.000
CPU max MHz:         3900.0000
CPU min MHz:         1200.0000
BogoMIPS:            5800.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

1 Like

I generate (many) random parameter initial conditions and they look identical on the two systems.

1 Like

I feel like this is might be a dick comment, but it’s important to discount nontheless. I’d be more worried about why machine precision differences in intermediate calculations would cause relative differences in the final result of 1e-4.

Is the algorithm stable? If you perturb your initial data by machine precision (or even expected experimental error), does the result vary by a similar magnitude?

I haven’t used any in anger, but there are various Monte Carlo packages you could try or, if computationally feasible, you could try using interval arithmetic.

5 Likes

If there is use of randomness in the process, then trying different seeds and getting a distribution of answers would be a ‘scientific’ way forward. Even if algorithm is deterministic, adding noise to input measurement (which in most cases is realistic) and getting a distribution of results is, again, ‘scientific’.
Once there is a distribution, rounding the mean result to an acceptable level of significant digits would make it identical across implementations.
Finally, adding a little measure of uncertainty would not detract from results.

4 Likes

I also experienced pretty different results depending on the OS in the unit tests of my package GitHub - ufechner7/KiteModels.jl: Kite and tether models for the simulation of kite power systems

I just made the allowed error margins very large, but I don’t think that is the correct way to handle these issues…

1 Like

To put it more bluntly, the quest for exact numerical reproducibility with floating point numbers is often very difficult, and in most cases not worth the effort. If small differences in input or differences that should be inconsequential (BLAS library, OS, architecture, etc) cause large differences in output, then either there’s a bug in one of the libraries, or your algorithm is unstable.

7 Likes

This is fair.

Exact reproducibility seems simplest and desirable for an academic replication package. Sure, it’s possible to make the point that results are “very similar,” but if they are exactly the same it’s a simpler check.

My guess is that with the optimization of complicated functions, it may not be surprising that over dozens or hundreds of iterations, small differences in inputs amplify (possibly reaching different local optima).
In my application, I am running the entire estimation (optimization) routine for lots of random initial conditions, and then taking the best. So, in some sense, I think that I am already following ideas in the spirit of the advice here. The 1e-4 error is for a single set of initial conditions.

I was just not aware of issues with floating point operations and different hardware.

2 Likes

There are lots of things that may turn out differently on different hardware, or sometimes even on the same hardware. To just give two examples:

  1. The muladd function either computes (x * y) + z with normal floating point behavior or a fused multiply/add without intermediate rounding, depending on which is fastest.
  2. Parallel reductions with multiple threads can give different results depending on how the computations get distributed between threads.
5 Likes

In this case it’s LsqFit.jl. That library just shouldn’t be used. I have been meaning to build a new curve fit library on NonlinearSolve.jl because of how often this comes up.

1 Like

Can you say a bit more about what goes wrong with LsqFit.jl for numerical stability?

It fails in a lot of cases that BFGS works fine. Its LM doesn’t seemed tuned right.

1 Like

Has that problem been reported? I could not find an issue in the LsqFit tracker that seems to match.
Would it be preferable to use CMPFit.jl?

Regarding reproducibility, I recently compiled and ran a 20+ year Fortran Sturm-Liouville solver on a modern machine and compiler, and could reproduce the results reported in the associated paper about perfectly…most results matched in all ~13-14 digits, I think some had differences in the last ones.
I was impressed. :smiling_face:

1 Like

I’ve mentioned it in Discourse and Slack more than a few times. I don’t think it needs more ink, just a solution, so I’ll go make one.

2 Likes