Multi-threading Heisenbug: code is faster when profiled

I have some trivially parallelizable for loops that I wanted to speed up using multi-threading, namely using Folds.jls Folds.foreach(). I started julia with julia -t 8 to get one interactive thread and 8 worker threads and observed that Profiling the code makes it faster:

# Without any profiling
julia> @time expect(args...; n_chunks=1)
  8.118154 seconds (5.13 k allocations: 50.581 MiB, 0.21% gc time)

julia> @time expect(args...; n_chunks=8)
 16.847068 seconds (64.31 k allocations: 55.753 MiB, 0.05% gc time)


# Profiling with @profile
julia> @time @profile expect(args...; n_chunks=1)
  9.119578 seconds (5.13 k allocations: 50.581 MiB, 0.13% gc time)

julia> @time @profile expect(args...; n_chunks=8)
  4.647983 seconds (64.16 k allocations: 55.751 MiB, 1.67% gc time)


# Profiling with @profview
julia> @time @profview expect(args...; n_chunks=1)
  9.244527 seconds (87.70 k allocations: 79.798 MiB, 0.33% gc time, 0.01% compilation time)
Gtk4.GtkWindowLeaf(accessible-role=GTK_ACCESSIBLE_ROLE_WINDOW, ...)

julia> @time @profview expect(args...; n_chunks=8)
  2.939202 seconds (266.93 k allocations: 86.765 MiB, 3.42% gc time, 0.02% compilation time)
Gtk4.GtkWindowLeaf(accessible-role=GTK_ACCESSIBLE_ROLE_WINDOW, ...)

Furthermore, the n_chunks parameter controls into how many paralelizable batches the data gets split into and we see that without any profiling increasing the number of batches slows the code down, even though it should in principle enable more threads to work in parallel. With profiling on the other hand, we see that increasing the number of batches does speed up the computation (although not as linearly as I had hoped).

Also, when watching htop while running above code I can see that when profiling the code all 8 CPU cores spike in usage, while without profiling they don’t. Overall, I am under the impression that when the code runs without profiling it isn’t actually multi-threading.

Did anyone have similar experiences before and would know what could cause multi-threaded code to run faster when profiled?

Unfortunately, I can’t provide an actual MWE because it is a pretty big code base, but any pointers what to look out for would be very welcome.

Other things I have tried
Using manual Threads.@threads instead of Folds.foreach() ended up being slightlly slower in all cases and the same Heisenbug prevailed.

I also replaced Folds.foreach with a simple foreach as a sanity check and got

julia> @time expect(args...; n_chunks=1)
  8.003162 seconds (5.13 k allocations: 50.581 MiB, 0.24% gc time)

julia> @time expect(args...; n_chunks=8)
  8.873845 seconds (6.22 k allocations: 51.221 MiB, 0.11% gc time)

julia> @time @profview expect(args...; n_chunks=8)
 10.364693 seconds (86.40 k allocations: 87.422 MiB, 0.21% gc time, 0.01% compilation time)

which is as expected. Increasing the number of batches makes the code simply marginally slower, because they are all processed in parallel and profiling the code makes it also a bit slower because of the profiling.

2 Likes

After a long time of debugging, I managed to reduce my code to a reproducible, minimal working example. Running the following code on julia 1.12 with julia -t 8

using Folds
using ProfileView

"""
Inserts all pairs in all buffer vectors in the i-th row of `buffer_matrix`
into the i-th dict in `dicts`
"""
function merge_buffers_into_dicts(dicts::AbstractVector{Dict{K,V}},
                                  buffer_matrix::AbstractMatrix{Vector{Pair{K,V}}}) where {K, V}
    Folds.foreach(dicts, eachrow(buffer_matrix)) do dict, buffers
        foreach(buffers) do buffer
            foreach(buffer) do kv
                k, v = kv
                dict[k] = v
            end
            empty!(buffer)
        end
    end
    return dicts, buffer_matrix
end

"""
Create a vector of Dict{Int, Float} as well as matrix of Vector{Pair{Int, Float}}
"""
function setup_dicts_and_buffers(terms_per_dict, terms_per_buffer, n_dicts)
    dicts = map(1:n_dicts) do _
        Dict(rand(1:n_dicts * terms_per_dict, terms_per_dict) .=> rand(terms_per_dict))
    end

    buffer_matrix = [
        [rand(1:n_dicts * terms_per_dict) .=> rand() for _ in 1:terms_per_buffer]
        for _ in 1:n_dicts, _ in 1:n_dicts
    ]
    return dicts, buffer_matrix
end

produces

# These values showed the issue particularly well in my case
terms_per_dict = 2000
terms_per_buffer = 120
n_dicts = 8     # set this to the number of threads
n_iterations = 100

dicts, buffer_matrix = setup_dicts_and_buffers(terms_per_dict, terms_per_buffer, n_dicts)
@time begin
    for _ in 1:n_iterations
        local_dicts = deepcopy(dicts)
        local_buffer_matrix = deepcopy(buffer_matrix)
        merge_buffers_into_dicts(local_dicts, local_buffer_matrix)
    end
end
#  1.059728 seconds (45.60 k allocations: 94.277 MiB, 2.64% gc time)


@time @profview begin
    for _ in 1:n_iterations
        local_dicts = deepcopy(dicts)
        local_buffer_matrix = deepcopy(buffer_matrix)
        merge_buffers_into_dicts(local_dicts, local_buffer_matrix)
    end
end
#  0.162798 seconds (86.32 k allocations: 98.100 MiB, 15.90% gc time, 16.94% compilation time)

where we can clearly see that wrapping the timing block als in @profview makes the code almost 6 times faster.

For completeness, I should also say that doing the same job with only one “chunk” but the same amount of data is a lot faster:

terms_per_dict = 8 * 2000
terms_per_buffer = 64 * 120
n_dicts = 1     # set this to the number of threads
n_iterations = 100

dicts, buffer_matrix = setup_dicts_and_buffers(terms_per_dict, terms_per_buffer, n_dicts)
@time begin
    for _ in 1:n_iterations
        local_dicts = deepcopy(dicts)
        local_buffer_matrix = deepcopy(buffer_matrix)
        merge_buffers_into_dicts(local_dicts, local_buffer_matrix)
    end
end
#    0.068958 seconds (3.10 k allocations: 64.984 MiB, 43.24% gc time)


@time @profview begin
    for _ in 1:n_iterations
        local_dicts = deepcopy(dicts)
        local_buffer_matrix = deepcopy(buffer_matrix)
        merge_buffers_into_dicts(local_dicts, local_buffer_matrix)
    end
end
#   0.119501 seconds (20.53 k allocations: 66.667 MiB, 17.89% gc time)

however, in my actual application I have cases where the code with multiple chunks and multi-threading is faster than in one chunk when profiling is on and slower without profiling on.

Wrapping everything in functions definitely excludes global variable effects but does not help here:

using ProfileView
using Folds
using BenchmarkTools

"""
Inserts all pairs in all buffer vectors in the i-th row of `buffer_matrix`
into the i-th dict in `dicts`
"""
function merge_buffers_into_dicts(dicts::AbstractVector{Dict{K,V}},
                                  buffer_matrix::AbstractMatrix{Vector{Pair{K,V}}}) where {K, V}
    Folds.foreach(dicts, eachrow(buffer_matrix)) do dict, buffers
        foreach(buffers) do buffer
            foreach(buffer) do kv
                k, v = kv
                dict[k] = v
            end
            empty!(buffer)
        end
    end
    return dicts, buffer_matrix
end

"""
Create a vector of Dict{Int, Float} as well as matrix of Vector{Pair{Int, Float}}
"""
function setup_dicts_and_buffers(terms_per_dict, terms_per_buffer, n_dicts)
    dicts = map(1:n_dicts) do _
        Dict(rand(1:n_dicts * terms_per_dict, terms_per_dict) .=> rand(terms_per_dict))
    end

    buffer_matrix = [
        [rand(1:n_dicts * terms_per_dict) .=> rand() for _ in 1:terms_per_buffer]
        for _ in 1:n_dicts, _ in 1:n_dicts
    ]
    return dicts, buffer_matrix
end


function main()

    # These values showed the issue particularly well in my case
    terms_per_dict = 2000
    terms_per_buffer = 120
    n_dicts = 8     # set this to the number of threads
    n_iterations = 100
    
    dicts, buffer_matrix = setup_dicts_and_buffers(terms_per_dict, terms_per_buffer, n_dicts)
    @time begin
        for _ in 1:n_iterations
            local_dicts = deepcopy(dicts)
            local_buffer_matrix = deepcopy(buffer_matrix)
            merge_buffers_into_dicts(local_dicts, local_buffer_matrix)
        end
    end
    #  1.059728 seconds (45.60 k allocations: 94.277 MiB, 2.64% gc time)
    
    
    @time @profview begin
        for _ in 1:n_iterations
            local_dicts = deepcopy(dicts)
            local_buffer_matrix = deepcopy(buffer_matrix)
            merge_buffers_into_dicts(local_dicts, local_buffer_matrix)
        end
    end
    #  0.162798 seconds (86.32 k allocations: 98.100 MiB, 15.90% gc time, 16.94% compilation time)
end

I get same results:

julia> main()
  1.004347 seconds (45.97 k allocations: 67.841 MiB, 2.40% gc time)
  0.317008 seconds (81.47 k allocations: 71.202 MiB, 6.29% gc time, 0.07% compilation time)
Gtk4.GtkWindowLeaf(accessible-role=GTK_ACCESSIBLE_ROLE_WINDOW, name="", parent, root, width-request=-1, height-request=-1, visible=true, sensitive=true, can-focus=true, has-focus=false, can-target=true, focus-on-click=true, focusable=false, has-default=false, receives-default=false, cursor, has-tooltip=false, tooltip-markup=nothing, tooltip-text=nothing, opacity=1.000000, overflow=GTK_OVERFLOW_HIDDEN, halign=GTK_ALIGN_FILL, valign=GTK_ALIGN_FILL, margin-start=0, margin-end=0, margin-top=0, margin-bottom=0, hexpand=false, vexpand=false, hexpand-set=false, vexpand-set=false, scale-factor=1, css-name="window", css-classes, layout-manager, limit-events=false, title="Profile  -  15:01:26", resizable=true, modal=false, default-width=800, default-height=600, destroy-with-parent=false, hide-on-close=false, icon-name=nothing, display, decorated=true, deletable=true, transient-for, application, default-widget, focus-widget, child, titlebar, handle-menubar-accel=true, is-active=false, suspended=false, startup-id, mnemonics-visible=false, focus-visible=false, maximized=false, fullscreened=false)

julia> main()
  1.010886 seconds (47.70 k allocations: 67.858 MiB, 2.04% gc time)
  0.271998 seconds (77.55 k allocations: 70.705 MiB, 4.44% gc time, 0.08% compilation time)
Gtk4.GtkWindowLeaf(accessible-role=GTK_ACCESSIBLE_ROLE_WINDOW, name="", parent, root, width-request=-1, height-request=-1, visible=true, sensitive=true, can-focus=true, has-focus=false, can-target=true, focus-on-click=true, focusable=false, has-default=false, receives-default=false, cursor, has-tooltip=false, tooltip-markup=nothing, tooltip-text=nothing, opacity=1.000000, overflow=GTK_OVERFLOW_HIDDEN, halign=GTK_ALIGN_FILL, valign=GTK_ALIGN_FILL, margin-start=0, margin-end=0, margin-top=0, margin-bottom=0, hexpand=false, vexpand=false, hexpand-set=false, vexpand-set=false, scale-factor=1, css-name="window", css-classes, layout-manager, limit-events=false, title="Profile  -  15:01:31", resizable=true, modal=false, default-width=800, default-height=600, destroy-with-parent=false, hide-on-close=false, icon-name=nothing, display, decorated=true, deletable=true, transient-for, application, default-widget, focus-widget, child, titlebar, handle-menubar-accel=true, is-active=false, suspended=false, startup-id, mnemonics-visible=false, focus-visible=false, maximized=false, fullscreened=false)


``
2 Likes

I can reproduce this:

julia> main();
  1.115940 seconds (96.33 k allocations: 96.611 MiB, 1.25% gc time, 5.49% compilation time)
  0.145159 seconds (128.10 k allocations: 100.102 MiB, 7.14% gc time, 29.09% compilation time)

However,

julia> Threads.@spawn main();
  0.121922 seconds (47.72 k allocations: 120.872 MiB, 41.39% gc time)
  0.211645 seconds (164.70 k allocations: 129.274 MiB, 4.31% gc time, 28.05% compilation time)

I’m not sure if @profview works properly in anything than thread 0, but anyway it appears that there’s something going on here, possibly with threads and Folds.jl? I don’t know.

1 Like

After a lot more debugging I found the culprit: It is ProfileView itself! `using ProfileView` kills multi-threading · Issue #252 · timholy/ProfileView.jl · GitHub

7 Likes

FWIW, appears to be a GTK4.jl issue (cf issue tracker, I posted details there).

1 Like