Identical code produces different results. sometimes

floswald · May 11, 2017, 8:56pm

I have a very strange bug in my program. The same code run in different sessions produces different results. It is impossible for me to produce a MWE because I cannot reproduce this behaviour in a small example. It only happens in the real application, and that is way too long to share here. I apologise for this, but there is nothing I can do about it and I’m really lost here. Thanks for having a look.

##Test

Here is the core of my test:

# I have module called bk that contains the code.
function testrun()
	p = bk.Param()       # create a parameter type
	m = bk.Model(p);   # create a model type
	bk.solve!(m,p);    # solve the model
       # get some objects from that solved model
       # those are Array{Float64}
       # not SharedArray or something fancy.
	vown   = copy(m.Owner.DW_V)
	vbar   = copy(m.Owner.vbar)
	vbar13 = copy(m.Owner13.vbar)
	ccp13  = copy(m.Owner13.CCP)
	vrent  = copy(m.Renter.DW_V)
	vown7  = copy(m.Owner7.DW_V)
	vown13 = copy(m.Owner13.DW_V)
	evown  = copy(m.Owner.EV)

	# completely erase that model
	m = 0
	gc()

	println("")
	info("run 2")
	println("")

       # create a new model. same parameter value.
	m2 = bk.Model(p);
	bk.solve!(m2,p);
        # see whether the result is the same
	dd = maxabs(vown[:] - m2.Owner.DW_V[:])
	dr = maxabs(vrent[:] - m2.Renter.DW_V[:])
        # success?
	succ = (dd == 0.0) && (dr == 0.0)

	m2 = 0
	gc()
	return succ
end

There is no rand anywhere in this program. it is a deterministic solution. a given p should imply a unique solution.
I start a first Julia session and run this test in a loop many times. it succeeds.
I start a second julia session and the test fails right away: the two models m and m2 are not identical. The error is very large.
this is very erratic: sometimes the test fails, sometimes it doesnt.
The memory footprint of each julia session is about 2.5Gb.
I can suppress this behaviour by commenting out a certain section of my code. I can’t find anything wrong with that section, it is very similar to several other parts of the code. it seems to work in session number one.

Questions

Is there anything non-deterministic in the way julia generates code across different sessions?
Could there be some strange numerical error, overflow/underflow for example, that could only occur when my computer is in a certain state? like some part of memory is empty? or some kind of process runs during compilation?
I just ran this test successfully for a 100 times. i exit julia, run it again and it fails on the first run. I call the test again (in the same session) and now it runs fine. How is this possible?
Could/Should I use valgrind to track this down? how?
thanks.

yuyichao · May 11, 2017, 9:05pm

Is there anything non-deterministic in the way julia generates code across different sessions?

Yes. All of the pointers will be different.

Could there be some strange numerical error, overflow/underflow for example, that could only occur when my computer is in a certain state? like some part of memory is empty? or some kind of process runs during compilation?

If you didn’t initialize an array, yes.

Could/Should I use valgrind to track this down? how?

I don’t think valgrind will be helpful (unless you are calling buggy C code). You can track where the result starts to diviate.

ChrisRackauckas · May 12, 2017, 1:02am

Change every memory allocation to zeros to make sure the arrays are zeroed. Do you still have this problem?

floswald · May 12, 2017, 6:30am

You mean it’s not enough to set the type that holds the arrays to zero as I do, but rather go inside the type and zero out each array individually? I thought doing what I do destroys the entire object. I’ll try!

Tamas_Papp · May 12, 2017, 7:01am

It is not clear what you mean here — how do you set a type to 0? What @ChrisRackauckas meant is that you should create arrays that are meant to contain zeros and not initialized otherwise with zeros, because constructors like Array{Float64}(...) just contain random values.

floswald · May 12, 2017, 7:26am

i thought

type m
     x :: Array
end

mm = m(rand(10))

mm = 0  #erases the array x?

that list line is what i meant.

Tamas_Papp · May 12, 2017, 8:04am

You are mistaken: it just sets m to the integer 0. Use m(zeros(10)) in the above example.

Also, if you ask questions here, investing effort into a MWE pays off.

floswald · May 12, 2017, 8:42am

thanks for that - I didn’t know that!
Apologies again for a non MWE - impossible under my current constraints. However, if you looked at my example, what I find is equivalent to

type myT
    x :: Array
end

t1 = myT(ones(3))
t2 = myT(ones(3))

maxabs(t1.x[:] - t2.x[:]) == 0.0  #is false

with the difference that my code assigns the values to array x, instead of the constructor, as here. Again, the weird thing is that sometimes this test passes if i repeat this for 100 times, and sometimes it does not. whether I copy x to a separate object and erase t1 before doing the test shouldn’t matter. t1 and t2 should always refer to different objects?

Tamas_Papp · May 12, 2017, 8:56am

Sorry, but lacking an MWE, we only have your claim for this equivalence. If your example indeed gave false, that would be a serious bug in Julia. An error on your part is much more likely.

Nope, an MWE is never impossible, you are just unwilling to invest time in making one. Yet you expect others to help you.

pkofod · May 12, 2017, 11:55am

Nine out of ten times this has happened to me, it has been array initialization. Very often you’ll see very small values, but there’s no guarantee. See how far the codes get before diverging. Then strip away the unnecessary parts (what can you remove without the divergence disappearing), and post whatever steps your code goes through up till that point. That’s your mwe.

Per · May 12, 2017, 12:04pm

Initially, any new memory that you allocate will have been zeroed out by the operating system. After the first garbage collection, new memory that you allocate might contain old data.

If you run the exact same code several times, with explicit garbage collection in between iterations, then you are likely to get objects that are initialized to their old values! This may effectively hide bugs where you read data before you have written it.

Example: (This is what happened for me. Your results would be different.)

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _  |  |
  | | |_| | | | (_| |  |  Version 0.6.0-rc1.0 (2017-05-07 00:00 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-apple-darwin13.4.0

julia> m = Array{Float64}(10)
10-element Array{Float64,1}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

julia> m=0; gc()

julia> m = Array{Float64}(10)
10-element Array{Float64,1}:
 5.28715e242 
 7.34851e-307
 0.0         
 0.0         
 0.0         
 0.0         
 0.0         
 0.0         
 0.0         
 0.0         

julia> m[1] = 42
42

julia> m = 0; gc()

julia> m = Array{Float64}(10)
10-element Array{Float64,1}:
 42.0         
  7.34851e-307
  0.0         
  0.0         
  0.0         
  0.0         
  0.0         
  0.0         
  0.0         
  0.0

Moral of the story is, make sure your objects are always properly initialized.

(There’s already an issue on this: https://github.com/JuliaLang/julia/issues/9147 )

floswald · May 12, 2017, 1:22pm

this is a great example @Per, thanks!

floswald · May 12, 2017, 1:23pm

The culprit was indeed: an array initialization with y=similar(x).

Topic		Replies	Views
Multi-threading changing results New to Julia	18	3681	August 21, 2020
Same code but different values of simulation from two computers New to Julia random	3	369	September 6, 2023
Getting different result from serial and multithreaded code in Julia New to Julia question	30	1045	January 24, 2023
Julia is Overwriting output of a Function General Usage question	2	440	September 13, 2021
Operation yields different results on other machine General Usage	3	372	April 17, 2021

Identical code produces different results. sometimes

Questions

Related topics