Multi-threading Heisenbug: code is faster when profiled

I have some trivially parallelizable for loops that I wanted to speed up using multi-threading, namely using Folds.jls Folds.foreach(). I started julia with julia -t 8 to get one interactive thread and 8 worker threads and observed that Profiling the code makes it faster:

# Without any profiling
julia> @time expect(args...; n_chunks=1)
  8.118154 seconds (5.13 k allocations: 50.581 MiB, 0.21% gc time)

julia> @time expect(args...; n_chunks=8)
 16.847068 seconds (64.31 k allocations: 55.753 MiB, 0.05% gc time)


# Profiling with @profile
julia> @time @profile expect(args...; n_chunks=1)
  9.119578 seconds (5.13 k allocations: 50.581 MiB, 0.13% gc time)

julia> @time @profile expect(args...; n_chunks=8)
  4.647983 seconds (64.16 k allocations: 55.751 MiB, 1.67% gc time)


# Profiling with @profview
julia> @time @profview expect(args...; n_chunks=1)
  9.244527 seconds (87.70 k allocations: 79.798 MiB, 0.33% gc time, 0.01% compilation time)
Gtk4.GtkWindowLeaf(accessible-role=GTK_ACCESSIBLE_ROLE_WINDOW, ...)

julia> @time @profview expect(args...; n_chunks=8)
  2.939202 seconds (266.93 k allocations: 86.765 MiB, 3.42% gc time, 0.02% compilation time)
Gtk4.GtkWindowLeaf(accessible-role=GTK_ACCESSIBLE_ROLE_WINDOW, ...)

Furthermore, the n_chunks parameter controls into how many paralelizable batches the data gets split into and we see that without any profiling increasing the number of batches slows the code down, even though it should in principle enable more threads to work in parallel. With profiling on the other hand, we see that increasing the number of batches does speed up the computation (although not as linearly as I had hoped).

Also, when watching htop while running above code I can see that when profiling the code all 8 CPU cores spike in usage, while without profiling they don’t. Overall, I am under the impression that when the code runs without profiling it isn’t actually multi-threading.

Did anyone have similar experiences before and would know what could cause multi-threaded code to run faster when profiled?

Unfortunately, I can’t provide an actual MWE because it is a pretty big code base, but any pointers what to look out for would be very welcome.

Other things I have tried
Using manual Threads.@threads instead of Folds.foreach() ended up being slightlly slower in all cases and the same Heisenbug prevailed.

I also replaced Folds.foreach with a simple foreach as a sanity check and got

julia> @time expect(args...; n_chunks=1)
  8.003162 seconds (5.13 k allocations: 50.581 MiB, 0.24% gc time)

julia> @time expect(args...; n_chunks=8)
  8.873845 seconds (6.22 k allocations: 51.221 MiB, 0.11% gc time)

julia> @time @profview expect(args...; n_chunks=8)
 10.364693 seconds (86.40 k allocations: 87.422 MiB, 0.21% gc time, 0.01% compilation time)

which is as expected. Increasing the number of batches makes the code simply marginally slower, because they are all processed in parallel and profiling the code makes it also a bit slower because of the profiling.

2 Likes

After a long time of debugging, I managed to reduce my code to a reproducible, minimal working example. Running the following code on julia 1.12 with julia -t 8

using Folds
using ProfileView

"""
Inserts all pairs in all buffer vectors in the i-th row of `buffer_matrix`
into the i-th dict in `dicts`
"""
function merge_buffers_into_dicts(dicts::AbstractVector{Dict{K,V}},
                                  buffer_matrix::AbstractMatrix{Vector{Pair{K,V}}}) where {K, V}
    Folds.foreach(dicts, eachrow(buffer_matrix)) do dict, buffers
        foreach(buffers) do buffer
            foreach(buffer) do kv
                k, v = kv
                dict[k] = v
            end
            empty!(buffer)
        end
    end
    return dicts, buffer_matrix
end

"""
Create a vector of Dict{Int, Float} as well as matrix of Vector{Pair{Int, Float}}
"""
function setup_dicts_and_buffers(terms_per_dict, terms_per_buffer, n_dicts)
    dicts = map(1:n_dicts) do _
        Dict(rand(1:n_dicts * terms_per_dict, terms_per_dict) .=> rand(terms_per_dict))
    end

    buffer_matrix = [
        [rand(1:n_dicts * terms_per_dict) .=> rand() for _ in 1:terms_per_buffer]
        for _ in 1:n_dicts, _ in 1:n_dicts
    ]
    return dicts, buffer_matrix
end

produces

# These values showed the issue particularly well in my case
terms_per_dict = 2000
terms_per_buffer = 120
n_dicts = 8     # set this to the number of threads
n_iterations = 100

dicts, buffer_matrix = setup_dicts_and_buffers(terms_per_dict, terms_per_buffer, n_dicts)
@time begin
    for _ in 1:n_iterations
        local_dicts = deepcopy(dicts)
        local_buffer_matrix = deepcopy(buffer_matrix)
        merge_buffers_into_dicts(local_dicts, local_buffer_matrix)
    end
end
#  1.059728 seconds (45.60 k allocations: 94.277 MiB, 2.64% gc time)


@time @profview begin
    for _ in 1:n_iterations
        local_dicts = deepcopy(dicts)
        local_buffer_matrix = deepcopy(buffer_matrix)
        merge_buffers_into_dicts(local_dicts, local_buffer_matrix)
    end
end
#  0.162798 seconds (86.32 k allocations: 98.100 MiB, 15.90% gc time, 16.94% compilation time)

where we can clearly see that wrapping the timing block als in @profview makes the code almost 6 times faster.

For completeness, I should also say that doing the same job with only one “chunk” but the same amount of data is a lot faster:

terms_per_dict = 8 * 2000
terms_per_buffer = 64 * 120
n_dicts = 1     # set this to the number of threads
n_iterations = 100

dicts, buffer_matrix = setup_dicts_and_buffers(terms_per_dict, terms_per_buffer, n_dicts)
@time begin
    for _ in 1:n_iterations
        local_dicts = deepcopy(dicts)
        local_buffer_matrix = deepcopy(buffer_matrix)
        merge_buffers_into_dicts(local_dicts, local_buffer_matrix)
    end
end
#    0.068958 seconds (3.10 k allocations: 64.984 MiB, 43.24% gc time)


@time @profview begin
    for _ in 1:n_iterations
        local_dicts = deepcopy(dicts)
        local_buffer_matrix = deepcopy(buffer_matrix)
        merge_buffers_into_dicts(local_dicts, local_buffer_matrix)
    end
end
#   0.119501 seconds (20.53 k allocations: 66.667 MiB, 17.89% gc time)

however, in my actual application I have cases where the code with multiple chunks and multi-threading is faster than in one chunk when profiling is on and slower without profiling on.