I have a directory with compressed CSVs that I’m reading in parallel. Roughly like so, simplifying away domain specific details not relevant to the question:
function munge_csv(stream)
...read a CSV with CSV.read, munge it, return...
end
function read_directory(dir)
tasks = map(readdir(dir; join = true)) do filepath
Threads.@spawn begin
df = open(ZstdDecompressorStream, filepath) do stream
return munge_csv(stream)
end
# this next lines producess a data race if included, wat?
@info "hi"
return df
end
end
dfs = fetch.(tasks)
@info "Finished fetching" n_dfs = length(dfs) df_sizes = sort(nrow.(dfs))
df = vcat(dfs...)
return df
end
With the flagged @info "hi"
line, I seem to have a race condition; the final results are inconsistent and the sizes of the task-specific dataframes are inconsistent. Without that logging line, there’s either no data race or at the least it’s rare enough that I haven’t seen it in a few dozen repeat trials.
What’s going on here? Is the logging call actually creating a threading problem, or is that a red herring? Is something else about this construct inherently not threadsafe, and the logging just makes it worse?