Same code, very different results with Julia 1.4.2 and Julia 1.5.2

CodeLenz · September 28, 2020, 4:06pm

Hi.

I am running a very long piece of code, with finite element, Monte Carlo Simulation (parallel loop) and topology optimization. It turns out that after moving to Julia 1.5.2, I observed a drastic change in all my solutions. Just to exemplify, this is the result obtained with julia1.5.2 -O3 --check-bounds=no, 10 processes

and this is the result obtained with julia 1.4.2, -O3 --check-bounds=no, 10 processes as well

I imagine that something changed between 1.4.2 and 1.5.2 that is impacting floating point computations. I know it can happens, but I really do not expect that large difference.

I am not using any external package to perform MCS and FEA, and the linear system of equations is solved by LU, with a sparse and complex stiffness matrix.

Can it be the change in LLVM, openblas or UMFPACK between Julia’s releases ? Is there any test I can perform to hunt it down?

Thank you very much.

Oscar_Smith · September 28, 2020, 4:23pm

Could the change be in how the numbers correspond to colors? One other possibility is that the random number generator is different, so if you’re simulations are sensitive to specific numbers being generated, that could be it.

CodeLenz · September 28, 2020, 4:33pm

No…the visualization is also performed by me. And the sampling is made at a very large number of points, such that it should not be related to the random numbers. I checked the histograms.

Thanks!

Oscar_Smith · September 28, 2020, 4:36pm

See you using the same versions of your libraries?

ctkelley · September 28, 2020, 4:43pm

Could a small change, like changes in order for summation accumulation or differences in the details of a library, have sent you to a different local minimum? Are the values of the objective function close for the two optimizations?

Have both simulations converged (whatever that means for you)? It seems that the 1.5.2 version is an unconverged version of the 1.4 result.

@Oscar_Smith’s comment could be spot on. Are the random number generators the same? Can you fix it so that the random number generators are the same via a 3rd source. You’d likely get a third solution, but would identify the problem. That has happened to me with randomized methods before. My solution was to live with it and tell people that it was a feature of the method.

CodeLenz · September 28, 2020, 5:49pm

@ctkelley and @Oscar_Smith . I was checking it right now. Both fresh installs with

[31c24e10] Distributions v0.23.12
[92933f4c] ProgressMeter v1.4.0
[90137ffa] StaticArrays v0.12.4

all the packages I need. I am also using the same “random” vector to generate 1E6 realizations and I am pre-processing it the same way (its is the same program).

Regarding the convergence. It is a very nonlinear and nonconvex problem, so I was also wondering if it was the case. It really seems to be a different solution path caused by floating point associativity since everything is the same (packages, code, hardware).

Thanks for all the suggestions. I will keep trying to figure out what is going on.

mbauman · September 28, 2020, 6:05pm

Do you seed the RNG? Perhaps try it on both 1.4 and 1.5 with StableRNGs.jl instead of the builtin RNG (whose generated values have and will change). That seems more likely than floating point associativity.

CodeLenz · September 28, 2020, 9:55pm

@mbauman
OK…it got even weird. I tested in a windows machine, with Intel processor (i5). Julia 1.4.2 and 1.5.2, builtin RNG. Both results are correct. The different results are obtained in a Linux machine (Fedora 32) with a Threadripper 3990X processor. So it is not seem to be related to the RNG.

tbenst · September 29, 2020, 7:49pm

Are you using the same BLAS / LAPACK libs, or are you using MKL on the intel chips?

Unfortunately there are also hardware differences… Do your intel chips have AVX-512? See https://github.com/scikit-learn/scikit-learn/issues/16097#issuecomment-606834582 for example

CodeLenz · September 29, 2020, 7:55pm

Plain vanilla Julia downloaded directly from the site. No MKL or any other library. The Intel processor (Windows machine) is an old I5, so no AVX-512 as well.

Thanks or the link.

lostella · September 29, 2020, 8:28pm

Is that a small enough problem instance that allows you to quickly try things out? Otherwise you probably want to reproduce the issue at a smaller scale.

With a small problem you could start dumping all intermediate results and go check where things start diverging. Of course make sure that you’re using the same exact random sequence.

Palli · September 29, 2020, 10:20pm

That is not deterministic, possibly (in general), as parallel hardware isn’t (I’m not sure how careful Julia is making sure deterministic; for some stuff, but programmers can usually use unsafe ways, I’m neither sure OpenBLAS has guarantees). Can you try and run on both versions serially?

Also with (I needed this to call Octave, that used OpenBLAS, couldn’t handle inv otherwise, crashed):

ENV[“OPENBLAS_NUM_THREADS”]=“1”

CodeLenz · September 29, 2020, 10:48pm

Thanks @lostella.
Indeed, I am trying to isolate the problem. Unfortunately, the code is very large and takes some time to run. What is bugging me is the difference with respect to Windows/Linux and Intel/AMD. Also, I have to perform some tests with parallel/serial as suggested by @Palli in both combinations.

Thank you very much for all the suggestions.

PetrKryslUCSD · September 29, 2020, 10:49pm

A good clue may be comparison of the Windows results with the Linux results. Do any of these pairs agree?

CodeLenz · September 29, 2020, 10:56pm

Hi @PetrKryslUCSD

The results with 1.4.2 match exactly with Linux (AMD) and with Windows (Intel), using -O3 --check-bounds=no and parallel processing (4 threads in the Windows machine and 10 in the Threadripper).

Windows, 1.5.2 results also match with the expected result. The only problem is Linux/AMD and 1.5.2 so far.

ffevotte · September 30, 2020, 7:56am

Is your code generic w.r.t the type of numbers used for computations? I.e. would it be feasible to double the precision of all FP numbers and see what happens? (Or switch to “exotic” FP number implementations such as StochasticArithmetic.jl?)

Another suggestion: so far, we advised you to try and make sure random numbers were exactly the same in all cases. But I don’t think we explored the other end of the spectrum: what happens if you change the random seed on all calculations? Do they converge to the same solution?

CodeLenz · September 30, 2020, 2:36pm

For a large number of samples in the MCS we obtain the “same” result regardless of the number of runs, for both 1.4.2 and 1.5.2 in the windows/Intel machines. The same for 1.4.2 in the Linux/AMD machines.
The “same” result here is the same topology and very close values for the expected value and the standard deviation of the solution.

All the code is strongly typed for Float64, ComplexF64 and Int64. Right now I am trying to make a similar and smaller code to help me understand the difference. But there is nothing on the underlining math that could justify the large difference we are observing between 1.4.2 and 1.5.2 in the Linux/AMD machine, even considering the use of MCS (due to the large number of samples and the careful analysis of the input histogram). Thanks for your help.

ffevotte · September 30, 2020, 4:19pm

Thanks! I’d say this rules out issues regarding the pRNG, no?

I can only encourage you to make you code generic w.r.t. the precision of numbers: this can help in a number of situations, including this one.

In the meantime, if you still suspect FP issues, you could try using a FP diagnosis tool such as Verrou (disclaimer: I’m one of the two main authors of this tool). Verrou is a Valgrind-based tool, which makes its basic features completely language-agnostic: it should work seamlessly for your Julia program.

In the simplest approach, you’d just need to invoke julia within the context of verrou, asking it to perturb all FP operations in such a way that each intermediate result is randomly rounded either upwards or downwards (instead of always rounding to the nearest FP value, like what is done by default). See the example session below to get a grasp of what this does: the same calculation performed several times does not produce the same result each time (and each one of these results is likely a bit different from the one obtained under standard IEEE-754 arithmetic).

> valgrind --tool=verrou --rounding-mode=average julia -q
==53062== Verrou, Check floating-point rounding errors
==53062== Copyright (C) 2014-2016, F. Fevotte & B. Lathuiliere.
==53062== Using Valgrind-3.16.1+verrou-dev and LibVEX; rerun with -h for copyright info
==53062== Command: julia -q
==53062== 
==53062== First seed : 257308
[...]
==53062== Backend verrou simulating AVERAGE rounding mode

julia> sum(0.1*i for i in 1:1000)
50050.00000000016

julia> sum(0.1*i for i in 1:1000)
50050.00000000009

julia> sum(0.1*i for i in 1:1000)
50050.00000000016

julia> exit()
==53062== 
==53062==  ---------------------------------------------------------------------
==53062==  Operation                            Instruction count
==53062==   `- Precision
==53062==       `- Vectorization          Total             Instrumented
==53062==  ---------------------------------------------------------------------
==53062==  add                     8087                     8087          (100%)
==53062==   `- flt                     3804                     3804      (100%)
==53062==       `- llo                     3804                     3804  (100%)
==53062==   `- dbl                     4283                     4283      (100%)
==53062==       `- llo                     4283                     4283  (100%)
==53062==  ---------------------------------------------------------------------
[...]
==53062== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If your calculation (say, on Linux+Julia 1.4.2) produces “approximately the same” results with Verrou’s random rounding mode than when run with standard FP arithmetic, then you should be able to more or less safely assume that errors don’t come from the propagation of FP round-offs.

Less drastic perturbations can sometimes be achieved by replacing standard rounding mode (round to nearest) with a directed rounding: this can be performed for example using valgrind --tool=verrou --rounding-mode=downward.

And it is always nice to perform a few safety checks first:

valgrind --tool=none will run your computation in the context of Valgrind, but without changing anything except that your threads will be “sequentialized” (i.e. they will run one after the other, not concurrently). If this changes your results, then such changes are likely due to the way threads are handled in your application (either that or you’ve bumped into a Valgrind bug).
valgrind --tool=verrou --rounding-mode=nearest will run your calculation in the context of Verrou, but without changing the rounding mode. If this dramatically changes your results, then you’ve probably bumped into a bug in Verrou.

Finally, be aware that Valgrind+Verrou adds quite a bit of overhead to your computation times. You can generally expect slow downs between 8x and 40x w.r.t. sequential run times.

CodeLenz · September 30, 2020, 4:22pm

Awesome! I will try it. Thank you for the suggestion!

CodeLenz · October 9, 2020, 11:39am

Hi there!

After careful evaluation of the problem I could finally track the error to sums involving the P norm of a complex array c and also its derivative with respect to another parameter. The expressions are something like

N = ( sum_i (conj(c_i) a_i c_i)^(P/2) )^(1/P)

and

D = sum_i (conj(c_i) a_i c_i)^(P/2 - 1)

( a is a real vector, but its expression is not important here). For large P and for some large differences in magnitude of the real and the complex parts of c, there is a large influence of floating point associativity. If we change

conj(c_i)*c_i for abs2(c_i) = real(c_i)*real(c_i) + imag(c_i)*imag(c_i)

the result changes in the 4th decimal place in some situations!

An interesting fact is that abs(c::Complex) uses hypot, which takes care of overflow and underflow.

The definition is

abs(z::Complex) = hypot(real(z), imag(z))

but

abs2(c_i) = real(c_i) real(c_i) + imag(c_i) imag(c_i)

with no special treatment for large differences with respect to its real and imaginary parts (I really don’t known if its needed, just thinking out loud)

The funny thing is the different behavior of Julia 1.4.2 and 1.5.2 and also the influence of the manufacturer (AMD/Intel).

So I guess it is how different LLVM versions are arranging things (order of operations) in different processors AND a clear tendency of our computations to lead to this unfortunate situation, of course.

Thus, the use of MCS, the large number of iterations in the optimization AND this “particularity” is the reason for the problem. We are working in some modifications of the original formulation to avoid such differences in magnitude and/or proper scaling. Alongside, I am writing a short version of this code to track the @code_llvm or @code_native versions of the code, so I can learn more about this sort of situation.

I would like to thank you all for the feedback @Palli, @ctkelley, @ffevotte, @tbenst, @Oscar_Smith, @lostella, @mbauman, @PetrKryslUCSD .

Topic		Replies	Views
Precision issue: (A.+R.-A.-R) do not vanish ?! Performance question	33	1844	July 29, 2018
Quite bad performance of Julia 0.6.4 vs Python+Numpy General Usage	26	5331	November 13, 2018
Sum of float64 vector gives slightly incorrect answer Performance question	44	5220	April 29, 2019
Tests failing on Linux in Travis CI Tooling testing , travis	16	1228	January 15, 2019
Unusual slow Julia code compared to C++? Performance	21	4569	July 30, 2019

Same code, very different results with Julia 1.4.2 and Julia 1.5.2

Related topics