Why does logging (@info) create a race condition?

I have a directory with compressed CSVs that I’m reading in parallel. Roughly like so, simplifying away domain specific details not relevant to the question:

function munge_csv(stream)
    ...read a CSV with CSV.read, munge it, return...
end

function read_directory(dir)
    tasks = map(readdir(dir; join = true)) do filepath
        Threads.@spawn begin
            df = open(ZstdDecompressorStream, filepath) do stream
                return munge_csv(stream)
            end
            # this next lines producess a data race if included, wat?
            @info "hi"
            return df
        end
    end
    dfs = fetch.(tasks)
    @info "Finished fetching" n_dfs = length(dfs) df_sizes = sort(nrow.(dfs))
    df = vcat(dfs...)
    return df
end

With the flagged @info "hi" line, I seem to have a race condition; the final results are inconsistent and the sizes of the task-specific dataframes are inconsistent. Without that logging line, there’s either no data race or at the least it’s rare enough that I haven’t seen it in a few dozen repeat trials.

What’s going on here? Is the logging call actually creating a threading problem, or is that a red herring? Is something else about this construct inherently not threadsafe, and the logging just makes it worse?

I would change:

df = open(ZstdDecompressorStream, filepath) do stream
                return munge_csv(stream)
            end

to:

local df = open(ZstdDecompressorStream, filepath) do stream
                return munge_csv(stream)
            end

to make sure df isn’t getting captured from an upper scope, when you do df = vcat(dfs...).

3 Likes

Ah, an embarrassing facepalm, thank you! I got got by df appearing later in a hard local scope than the first do block closures.