`objectid` and reproducibility

I just spent half a day debugging very weird non-deterministic behaviour in my simulation. The culprit as it turns out was one little piece of code where I used sorting an array of objects by objectid as a way to make getting rid of duplicates more efficient. Apparently objectid is not derministic, though, so the order of objects varied between simulation runs leading to (numerically) different results even with the same random seed.

It’s not difficult to avoid the issue, once you know that it exists, but finding it was a bit of a hassle…

ETA: issue

I’ve run into a related issue: If you define hash for some new type T and that hash involves objectid(T), then dictionaries with key type T are displayed non-deterministically. If such a dictionary shows up in Julia output inside some docstring, then Documenter.jl will complain all the time that the docstring is not accurate anymore.

Python switched its dicts to ordered dicts by default, and solving those non-determinacy problems is a major benefit. I use OrderedDict by default nowadays (from OrderedCollections.jl), unless I have a very good reason not to.

1 Like

I’d try Base.IdSet for this. It lacks a convenient constructor, but you can do

set = union!(Base.IdSet(), arr)

Sets have undefined order, so for reproducibility you probably want something like

ordered_and_unique = sort!(collect(set); by=<something>)

Caveat: Base.IdSet isn’t documented. I don’t know why; it’s built on IdDict the same way Set is built on Dict, and IdDict is well-documented and exported.

Fwiw IdSet uses objectid as it’s hashing function.

As people pointed out, objectid is simply the (slightly mangled) address for addressable objects (i.e. for objects where pointer_from_objref is doesn’t throw).

This fact looks unlikely to go away soon. The current implementation implicitly guarantees the following very nice property:

If you have two addressable objects that are not identical / egal, then their objectids will be different.

This subtle undocumented fact is internally relied on. For example, the current implementation of ScopedValues uses a persistent dict / HAMT that just doesn’t handle hash-collisions. This is fine, because hash-collisions are fundamentally impossible as long as you remember never to feed in non-addressable objects (objectids can be re-used after the old object dies; that’s the responsibility of allocator/GC)!

So, I guess you should consider asking this question to your operating system / libc, because the exact addresses of virtual memory handed out there is probably the ultimate source of your non-reproducibility.

(yes, this is another complication in moving GC)

I very much second this recommendation. Introducing very-hard-to-reproduce unintentional randomization into a software project sucks so hard!

At work we had one incident where somebody computed a checksum over some data by processing entries in the order they fell out of a java HashMap iterator, to be used for some caching. Later that was upgraded to a sha checksum, and another team thought “these don’t collide, let’s use that as a key/identity for our database”.

That bug laid dormant for years until it exploded into our face, and left us with years of subtly broken/inconsistent history that was impossible to clean up :scream:. Upshot was a soft policy of “unordered collections are just as suspect as sun.misc.Unsafe – don’t do it without at least a paragraph of justifying source-code comment”.

2 Likes

That’s simply a doc test bug (your bug). Use [...] in the doc string where the dictionary is shown: Writing Documentation.

Another solution, adopted by the Go language, is to randomize the iteration order every time.

2 Likes

That doesn’t work if I want the output to appear in the docstring. Think of a function that returns a dictionary. If I want to illustrate what such a function does, then omitting the output doesn’t make sense.

Then use julia-repl instead of jldoctest, in addition to unit tests.

Following on Neven’s suggestion, you could also show the dictionary entries explicitly:

    """
        f(x)

    Returns a dict.

    # Examples
    ```jldoctest
    julia> d = f(1);

    julia> d["a"], d["b"]
    (1, 2)
    ```
    """
    function f(x)
        return Dict("a"=>x, "b"=>1+x)
    end

That way, you don’t (incorrectly) assume any ordering.

2 Likes