How to isolate & report non-deterministic bug

I have a large (15k LOC) calculation that works fine in 1.10.9 but gives an error in 1.11.4 (nothing else changed, not even the manifest).

I isolated the function where this happens and I can reproduce it (within my codebase). So I put some @show statements inside the code to figure out what is going on… and the bug disappears.

I am at loss how to debug this further. The code does not involve threading or async. The error is a counter being off by 1, as if a particular line was elided by the compiler (but how can I check this?).

But reproducing requires about 7k LOC, which I don’t want to make public at this point. Any attempts to reduce this to an MWE make the bug disapper.

You can attempt to use creduce to (somewhat automatically) try to delete stuff that doesn’t influence the bug, but chances are that this is an issue due to either effects (which @show would influence negatively) or autovectorization (in which case deleting some lines may also have the same issue as @show). Are there any uses of @simd or other vectorization libraries in your code? Is unsafe_* used internally and some bounds check is mistakenly removed with @inbounds? Can you reproduce the issue with --check-bounds=true?

3 Likes

@show might also influence things like similar by changing what’s present in memory before you make the first write. If your algorithm happens to read it (which would be a coding mistake), that might result in a heisenbug.

4 Likes

Thanks @Sukera and @gdalle. It is not a bounds issue, I am not using unsafe_ or @simd.

The error happens in a loop not unlike

while true
    (; τ, log_mass, periods_left) = get_current_state(MLI)
    pma = period_mass_aggregates(model_approximations, group_information, P, τ, log_mass)
    mass_aggregates = add_mass_aggregates!(mass_aggregates, pma)
    if periods_left > 0
        advance_state!(MLI)
    else
        break
    end
end

The key parts are get_current_state and advance_state!, the latter checks internally for periods_left >=0 and throws an error.

@code_typed tells me that both functions are inlined, and the function that contains the loop becomes a 800-line monster. Only one instance of periods_left survives (the check), so the compiler must think it can elide the others.

@noinline for both functions seems to fix the issue.

I think this is a compiler bug. I am still not sure if it is fixed in master as I have trouble loading some packages in the reproducer code, I will check that and report back.

The fact that @noinline fixes the issue is indeed a good indicator of this being a compiler bug. Does running with -O0 also fix this? Does the error reproduce on nightly?

Unfortunately, without a reproducer you can share with core devs it’s unlikely that a specific cause can be identified :confused:

2 Likes

I have a reproducer that I can share privately. Should I open an issue on Github?

Unfortunately I cannot tell because I cannot load JLD2, see

when that is fixed I will check.

Is JLD2 core to your code, or can you replace the data loading with some hardcoded/generated data?

Yes, but it will take about half day; I will look into it.

1 Like

I managed to make a LWE (large working example :wink:) without JLD2, and the bug is not present on nightly.

I am planning to bisect Julia from 1.11.4 to nightly to see what it was, is make cleanall; make sufficient between steps?

2 Likes

Yes, I think so (I use that because I’ve seen it recommended somewhere).

1 Like

EDIT no, it was a path issue, my problem is fixed on master. Stay tuned for the bisection…

4 Likes

I tried bisecting and narrowed it down to a handful (~15) of commits, of which

is the most likely suspect (for fixing it). Evidence for this is circumstantial (see below), but I think it is pretty likely: it was OK in 1.10, broken in 1.11, fixed in master, and indeed a setfield! form is missing from the lowered expression.

On a related note, I found bisecting a Julia with a nontrivial reproducer (that uses a lot of packages) quite challenging. I am running into issues with resolving packages (in theory, one can always go up and keep the Manifest.toml, but if some packages lower bound then this breaks), and I am running into issues like

which eventually prevented me from pinning this down.

8 Likes

Follow-up: with the help I got in the above topic I was able to bisect, which confirmed the guess above: it was #57141, fixed by #57201.

6 Likes

A subtle and hard-to-debug correctness error like this seems quite serious. The fix in #57201 doesn’t seem marked to be backported to 1.11, and it seems worth asking for that so we get this fix in the next minor version of 1.11 instead of only with (potentially far in the future) 1.12 release.

1 Like

The two MWEs in #57141 also appear to work in v1.11.4 (not sure because an expected result wasn’t provided), so maybe #57201 fixed two things and we’re good?

This comment does say that “things work as intended” in Julia 1.11.3. (And the MWEs run without error on my 1.11.4 too.)

But that issue thread points to this commit as the bisect result for the origin of the problem, which got merged in Aug 2024 before 1.11’s release.

And the problem in the original post here occurred under 1.11.4.

At this point, I don’t know what to think anymore :smile: @Tamas_Papp I know you’ve already put in a lot of work into this, but are you confident that that is the issue that caused your problem too, and the PR that fixes it? The MWEs there don’t cause errors in 1.11.4 where your issue occurred.

It seems plausible that the same underlying bug was expressed differently in 1.11 (in your code) and 1.12 (in the MWEs), and so was fixed by the same PR, but that seems worth verifying before any backport attempts. Would it be possible for you to merge just the changes in #57201 into the v1.11.4 tag of the julia repo and see if your code returns the right result in that build?