How to maximize replicability of a package

I’m developing a new package for a broad audience with prospective users valuing replicability/reproducibility of results.

On my laptop and desktop which have different versions of Ubuntu, different hardware (AMD versus Intel), different versions of Julia, different versions of included packages, etcetera, the results are identical.

But the github tests (nightly and 1.8) produce numbers that are identical to each other and fairly close, but not identical, to mine. Those machines also run Ubuntu, one runs 1.8, and I haven’t checked all package versions.

How can I maximize the likelihood that results can be replicated by others? (within reason)

Thanks!

1 Like

Here are the main things that come to my mind:

4 Likes

Thanks @gdalle. I’ll try StableRNGs (currently using Random123 on one thread with a specified seed if replicability is turned on). As for committing the manifest, I want other people to have results that are replicable, but not to constrain them.

Also check that the results do not depend on parallel execution (threads?).
For instance parallel loops may produce results with minor differences that would
depend on the number of threads used.

1 Like

Thanks @PetrKryslUCSD .

  1. it’s not the rng
  2. it’s not the number of Julia threads, even though I run nested dynamic threaded for loops
  3. the number of BLAS threads appears to matter, but that’s not the only issue

It can depend on your workload whether perfect reproducibility can be achieved reasonably well. Even if you have something as simple as matrix vector products of floating point numbers, the exact results of optimized implementations will depend on the CPU architecture you use, e.g., the type of SIMD instructions available. Thus, if you want to achive perfect reproducibility, you need to restrict everything to the least common denominator - do not use BLAS but plain loops without SIMD instructions, do not use any of the great performace packages such as LoopVectorization.jl etc. I am not sure if this is really what you want. For me, that wouldn’t be an option. That’s why I am happy with reproducibility up to floating point differences.
However, the setting can of course be different if you are only concerned about integer arithmetic.

1 Like

Thanks @ranocha. It’s not what I want, but what journals require. Several journals in my discipline now require replicability. LoopVectorization, Tullio, etc, are wonderful tools but not for that purpose. For one thing, they turn on fastmath.

I’m curious to see the precise wording of these policies.

Cross-platform bit-for-bit replicability is extraordinarily hard to achieve in modern scientific code because it would disallow the use of practically all high-performance libraries. Even a library function as simple as sum (in Julia, Numpy, and probably many other languages/libraries) will give slightly different results depending on whether the CPU has SIMD instructions.

4 Likes

Thanks @stevengj !

I don’t typically have to deal with such issues myself, but users of the package would. Here’s an example of the workflow at one prominent organization that runs several journals: Step by step guidance - Office of the AEA Data Editor They have a data editor who checks whether s/he can reproduce the results that you are reporting.

I looked through these pages, and it doesn’t seem like it gives a precise definition of “replicating” results. However, it says to “identify all outputs as per the README and the list of tables/figures/in-text numbers” and then “compare the output to the tables/figures/numbers in the manuscript”. Note that tables and numbers in the manuscript are very unlikely to be given to more than a few significant digits, and comparing to “figures” is even more imprecise.

So my conclusion from this policy is that they don’t require bit reproducibility, only reproducibility to the level of precision claimed in the manuscript.

4 Likes

True, that’s what I want to ensure.

One wants to make sure that a manuscript only reports accurate digits, i.e. digits that are insensitive to roundoff and other errors. But this is not a matter of bit reproducibility at all — it’s a question of numerical analysis.

(Even if floating-point roundoff errors — and other errors, such as truncation/discretization errors, etcetera — are exactly bit-reproducible, that doesn’t mean that all of the digits are actually significant/converged with respect to the underlying problem being solved!)

For the package author, helping users to do this would mean documenting methods/suggestions to estimate error bars on your return values. (e.g. you may directly return some error estimate, similar to packages like QuadGK, or you may suggest a procedure like repeating the calculation at different precisions or with different tolerance parameters or different random samples.)

4 Likes

It’s totally normal to see slightly different results on different hardware. The AEA policies are definitely NOT asking for digit-by-digit replicability. If in the paper, the author reports the effect of increasing X on Y is 0.512, for example, then nobody will consider there is a replicability issue if someone gets 0.51201 on their computer. No economist would even bother to ask for these extra digits.

Also, if you try Stata, which is widely used among economists, you will see that the numbers are not completely identical either. This is not something special to Julia.

Having a stored Manifest is fundamental to getting well defined versions of all dependencies and really something you want for reproducibility. And I don’t see how it’s constraining anyone. If someone thinks that the reproducibility isn’t all that important, all they have to do is Pkg.update and it’s like you didn’t provide that Manifest.

2 Likes

Of course. But are you implying that small numerical inconsistencies cannot accumulate to something substantial? That whenever e.g. fastmath and ieee math produce large numerical differences the end result produced by ieee math is unreliable anyway? I disagree.

Btw: I never mentioned bit-for-bit replicability; didn’t intend that.

Regardless, I know enough now. Thanks all.

Of course they can — the accumulation/growth/propagation of error is a major topic of study in numerical analysis. The “error bars” on the results of a numerical calculation are normally much larger than machine precision.

But it has almost nothing to do with whether different hardware etc. produces the same result — just because digits are reproducible doesn’t mean they are accurate! That’s why I think that most of this discussion has centered on a red herring.

2 Likes

Thanks @stevengj. Sounds like I’ve done the max I can reasonably do in this regard already.