Identical code produces different results. sometimes

question

#1

I have a very strange bug in my program. The same code run in different sessions produces different results. It is impossible for me to produce a MWE because I cannot reproduce this behaviour in a small example. It only happens in the real application, and that is way too long to share here. I apologise for this, but there is nothing I can do about it and I’m really lost here. Thanks for having a look.

##Test

Here is the core of my test:

# I have module called bk that contains the code.
function testrun()
	p = bk.Param()       # create a parameter type
	m = bk.Model(p);   # create a model type
	bk.solve!(m,p);    # solve the model
       # get some objects from that solved model
       # those are Array{Float64}
       # not SharedArray or something fancy.
	vown   = copy(m.Owner.DW_V)
	vbar   = copy(m.Owner.vbar)
	vbar13 = copy(m.Owner13.vbar)
	ccp13  = copy(m.Owner13.CCP)
	vrent  = copy(m.Renter.DW_V)
	vown7  = copy(m.Owner7.DW_V)
	vown13 = copy(m.Owner13.DW_V)
	evown  = copy(m.Owner.EV)

	# completely erase that model
	m = 0
	gc()

	println("")
	info("run 2")
	println("")

       # create a new model. same parameter value.
	m2 = bk.Model(p);
	bk.solve!(m2,p);
        # see whether the result is the same
	dd = maxabs(vown[:] - m2.Owner.DW_V[:])
	dr = maxabs(vrent[:] - m2.Renter.DW_V[:])
        # success?
	succ = (dd == 0.0) && (dr == 0.0)

	m2 = 0
	gc()
	return succ
end
  • There is no rand anywhere in this program. it is a deterministic solution. a given p should imply a unique solution.
  • I start a first Julia session and run this test in a loop many times. it succeeds.
  • I start a second julia session and the test fails right away: the two models m and m2 are not identical. The error is very large.
  • this is very erratic: sometimes the test fails, sometimes it doesnt.
  • The memory footprint of each julia session is about 2.5Gb.
  • I can suppress this behaviour by commenting out a certain section of my code. I can’t find anything wrong with that section, it is very similar to several other parts of the code. it seems to work in session number one.

Questions

  • Is there anything non-deterministic in the way julia generates code across different sessions?
  • Could there be some strange numerical error, overflow/underflow for example, that could only occur when my computer is in a certain state? like some part of memory is empty? or some kind of process runs during compilation?
  • I just ran this test successfully for a 100 times. i exit julia, run it again and it fails on the first run. I call the test again (in the same session) and now it runs fine. How is this possible?
  • Could/Should I use valgrind to track this down? how?
  • thanks.

#2

Is there anything non-deterministic in the way julia generates code across different sessions?

Yes. All of the pointers will be different.

Could there be some strange numerical error, overflow/underflow for example, that could only occur when my computer is in a certain state? like some part of memory is empty? or some kind of process runs during compilation?

If you didn’t initialize an array, yes.

Could/Should I use valgrind to track this down? how?

I don’t think valgrind will be helpful (unless you are calling buggy C code). You can track where the result starts to diviate.


#3

Change every memory allocation to zeros to make sure the arrays are zeroed. Do you still have this problem?


#4

You mean it’s not enough to set the type that holds the arrays to zero as I do, but rather go inside the type and zero out each array individually? I thought doing what I do destroys the entire object. I’ll try!


#5

It is not clear what you mean here — how do you set a type to 0? What @ChrisRackauckas meant is that you should create arrays that are meant to contain zeros and not initialized otherwise with zeros, because constructors like Array{Float64}(...) just contain random values.


#6

i thought

type m
     x :: Array
end

mm = m(rand(10))

mm = 0  #erases the array x?

that list line is what i meant.


#7

You are mistaken: it just sets m to the integer 0. Use m(zeros(10)) in the above example.

Also, if you ask questions here, investing effort into a MWE pays off.


#8

thanks for that - I didn’t know that!
Apologies again for a non MWE - impossible under my current constraints. However, if you looked at my example, what I find is equivalent to

type myT
    x :: Array
end

t1 = myT(ones(3))
t2 = myT(ones(3))

maxabs(t1.x[:] - t2.x[:]) == 0.0  #is false

with the difference that my code assigns the values to array x, instead of the constructor, as here. Again, the weird thing is that sometimes this test passes if i repeat this for 100 times, and sometimes it does not. whether I copy x to a separate object and erase t1 before doing the test shouldn’t matter. t1 and t2 should always refer to different objects?


#9

Sorry, but lacking an MWE, we only have your claim for this equivalence. If your example indeed gave false, that would be a serious bug in Julia. An error on your part is much more likely.

Nope, an MWE is never impossible, you are just unwilling to invest time in making one. Yet you expect others to help you.


#10

Nine out of ten times this has happened to me, it has been array initialization. Very often you’ll see very small values, but there’s no guarantee. See how far the codes get before diverging. Then strip away the unnecessary parts (what can you remove without the divergence disappearing), and post whatever steps your code goes through up till that point. That’s your mwe.


#11

Initially, any new memory that you allocate will have been zeroed out by the operating system. After the first garbage collection, new memory that you allocate might contain old data.

If you run the exact same code several times, with explicit garbage collection in between iterations, then you are likely to get objects that are initialized to their old values! This may effectively hide bugs where you read data before you have written it.

Example: (This is what happened for me. Your results would be different.)

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _  |  |
  | | |_| | | | (_| |  |  Version 0.6.0-rc1.0 (2017-05-07 00:00 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-apple-darwin13.4.0

julia> m = Array{Float64}(10)
10-element Array{Float64,1}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

julia> m=0; gc()

julia> m = Array{Float64}(10)
10-element Array{Float64,1}:
 5.28715e242 
 7.34851e-307
 0.0         
 0.0         
 0.0         
 0.0         
 0.0         
 0.0         
 0.0         
 0.0         

julia> m[1] = 42
42

julia> m = 0; gc()

julia> m = Array{Float64}(10)
10-element Array{Float64,1}:
 42.0         
  7.34851e-307
  0.0         
  0.0         
  0.0         
  0.0         
  0.0         
  0.0         
  0.0         
  0.0         

Moral of the story is, make sure your objects are always properly initialized.

(There’s already an issue on this: https://github.com/JuliaLang/julia/issues/9147 )


#12

this is a great example @Per, thanks!


#13

The culprit was indeed: an array initialization with y=similar(x).