Unit tests with random seed: local vs Travis

I am working on a library for Hamiltonian Monte Carlo methods (I am aware of other libraries, but I want to experiment with some new methods). It has some test where

  1. it samples from a distribution,
  2. then obtains statistics which should be standard normal if things are working correctly,
  3. the largest of these statistics is compared to a threshold.

As usual, the trade-off for these stochastic tests is between false positives and false negatives. My strategy is that I run locally, examine results when tests fail, but want CI tests on Travis to run smoothly if they run locally with the same random seed (yes, I like that green badge).

So my runtests.jl file begins with

"RNG for consistent test environment"
const RNG = srand(UInt32[0x23ef614d, 0x8332e05c, 0x3c574111, 0x121aa2f4])

where the argument is a “random” value.

My problem is the following: for the same version of Julia, I get different test outcomes on Travis, compared to local tests. Also, frequently tests pass locally, and fail on Travis (but this could be selection bias, since when they fail locally I don’t upload them).

I thought that setting the random seed would provide a consistent environment. Is this an unreasonable expectation? What can I do to ensure test consistency?

PS: locally

Julia Version 0.6.0
Commit 9036443 (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

on Travis

Commit 9036443 (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)
1 Like

I think that with a given seed, your tests should be reproducible. In your example, RNG becomes equal (===) to Base.Random.GLOBAL_RNG: is it possible that the global RNG is used somewhere else that you didn’t expect? Why not set const RNG = MersenneTwister(UInt32[...])?

1 Like

Yes, that’s what I thought, so I tried printing a hash of the rng. Not using Base.hash, since that just uses object ID, but

RNG = srand(UInt32[0x23ef614d, 0x8332e05c, 0x3c574111, 0x121aa2f4])
rand(RNG, 10000);
fingerprint(x) = hex(foldr(hash, zero(UInt64), getfield.(x, fieldnames(x))))

So here is the catch: the random numbers are the same, but the hash is different after each run. The way to reproduce it is to put it in a file, and run from scratch (eg julia script.jl) separately. Just pasting it into the REPL will always give the same result. But a new process will give a different result. I wonder what is going on.

So it seems to come from the fact that Base.dSFMT.DSFMT_state is hashed according to its object_id (i.e. the default). A solution would be to specialize hash(::Base.dSFMT.DSFMT_state, ::UInt64), and even hash(::MersenneTwister, ::UInt64) for that matter.

I have been debugging this all morning, printing out values of

peekrand(rng) = println("rng next $(hex(rand(copy(RNG), UInt64))")

and I am now suspecting that this is not related to the RNG at all, but subtle differences in floating point calculations, which diverge and at some point the difference accumulates to take the calculations to different branches, resulting in different number of rand calls, and thus the divergence.

  1. Is this even possible? That is to say, if two machines run the same version of Julia and packages, and both are 64-bit CPUs, can floating point still give different results?

  2. how can I verify that this is the case?

  3. any way to control this?

I have only ever seen such differences between vendors, i.e. I got slightly different results on AMD CPUs than on Intel CPUs. But if the CPU model is the same, and everything in your algorithm (including the RNG) is deterministic, then you should get identical results.

Highly speculative question: Is Travis using ECC memory?

For hash concerns, I just made a PR.

I have narrowed it down to an MWE, which I put in a repo.

It is only ~10 loc. Apparently when I construct a Distributions.MvNormal, the resulting object is different on different machines. It has nothing to do with random seeds or numbers.

Any help debugging this would be welcome, I am kind of stuck.

Even more specific MWE in the repo: cholfact gives a different result on my machine and Travis, even if versioninfo() is the same.

This is expected. BLAS, LAPACK, SuiteSparse, FFTW, can all produce slightly different results on different machines. Same with julia with @fastmath, muladd, @simd, etc.

@yuyichao: Thanks for the info. Could you elaborate where these differences are coming from? My guess is a) run-time detection of CPU features, so that the JIT chooses different instruction sets, b) run-time detection of cache sizes, so that BLAS et al choose different block sizes for their algorithms.

One of the goals of Julia is to facilitate reproducible science. By having Julia, all packages, and all dependencies under version control, we should be able to exactly reproduce the code which was used to produce some research result. In this spirit, could we have some tuning knobs which can make the low-level parts behave predictably, even between different machines?

1 Like

All of above. All the features/libraries I mentioned above make calculation that are “mathematically correct” but may not be the same as IEEE standard. FWIW, I don’t think libm functions are completely reproducible either though few (none?) currently have code producing different result depending on runtime env.

Or I guess I should say there’s no standard saying what they should return so they can be as accurate/inaccurate as they want. If you are doing simple arithmetic though, returning a result one ulp off would be wrong.

Totally agree deterministic runs are definitely something to strive for (but alas this holy grail is often out of reach without a LOT of work). Two ideas which I think are useful:
(1) To always use a non-global RNG for all the work in the package. This protects against the sins of other packages which might be less deterministic or create side-effects on the global RNG.

(2a) When doing Floating-point you essentially forgo determinism. The intricacies and delicacies are too much.

(2b) If you still need determinism restored, you need to do some rounding (and praying). Rounding can for example be with:

trunceps(x) = trunc(x/sqrt(eps(typeof(x))))*sqrt(eps(typeof(x)))

This definition can be broadcast to Arrays and uses the same default as isapprox. Of course, a proper definition would have an eps argument etc.
With this definition:

using Base.Test
trunceps(x) = trunc(x/sqrt(eps(typeof(x))))*sqrt(eps(typeof(x)))

A = rand(10,10)
B = asin.(sin.(A))
@test A != B
@test trunceps.(A)==trunceps.(B)

With a reasonable eps the numerical errors can be smoothed out for good algorithms.

Thanks, this is useful. My problem is that the result of my algorithm is nonlinear, in the sense that as errors accumulate, different branches get executed once in a while (after about 10⁷-10¹⁰ floating point operations). This is compounded by the fact that I am testing a stochastic algorithm, and balancing between false negatives and positives.

Thanks for the explanations, I learned a lot from this topic.