Why does logging (@info) create a race condition?

evanfields · August 20, 2024, 7:47pm

I have a directory with compressed CSVs that I’m reading in parallel. Roughly like so, simplifying away domain specific details not relevant to the question:

function munge_csv(stream)
    ...read a CSV with CSV.read, munge it, return...
end

function read_directory(dir)
    tasks = map(readdir(dir; join = true)) do filepath
        Threads.@spawn begin
            df = open(ZstdDecompressorStream, filepath) do stream
                return munge_csv(stream)
            end
            # this next lines producess a data race if included, wat?
            @info "hi"
            return df
        end
    end
    dfs = fetch.(tasks)
    @info "Finished fetching" n_dfs = length(dfs) df_sizes = sort(nrow.(dfs))
    df = vcat(dfs...)
    return df
end

With the flagged @info "hi" line, I seem to have a race condition; the final results are inconsistent and the sizes of the task-specific dataframes are inconsistent. Without that logging line, there’s either no data race or at the least it’s rare enough that I haven’t seen it in a few dozen repeat trials.

What’s going on here? Is the logging call actually creating a threading problem, or is that a red herring? Is something else about this construct inherently not threadsafe, and the logging just makes it worse?

nhz2 · August 20, 2024, 7:57pm

I would change:

df = open(ZstdDecompressorStream, filepath) do stream
                return munge_csv(stream)
            end

to:

local df = open(ZstdDecompressorStream, filepath) do stream
                return munge_csv(stream)
            end

to make sure df isn’t getting captured from an upper scope, when you do df = vcat(dfs...).

evanfields · August 21, 2024, 2:03pm

Ah, an embarrassing facepalm, thank you! I got got by df appearing later in a hard local scope than the first do block closures.

Topic		Replies	Views
Race condition reading from Pipe Internals & Design question , io , async	1	311	January 11, 2024
Data Races General Usage multithreading	6	1233	August 5, 2020
Help understanding what happens in a `@threads` loop with race conditions General Usage	13	1373	April 15, 2020
How to early return from inside a @threads loop General Usage question , parallel , multithreading , threads	6	808	April 14, 2023
Data racing with @threads with "for-loop" New to Julia question , bug , multithreading , threads	3	559	March 10, 2023

Why does logging (@info) create a race condition?

Related topics