Error when using a df load from a CSV and nthreads > 1


I encountered an error when I load a CSV file into a DataFrame and use stack. I narrowed it down that it only occurs if I run Julia --threads=auto (Threads.nthreads() > 1).

I was able to reproduce the error in this MWE:

using CSV
using DataFrames

N = 100_000
df = DataFrame(index=1:N, parameter=rand(N))
CSV.write("test.csv", df)
df2 ="test.csv", DataFrame)

stack(df, "index", Not("index"))
# this works
stack(df2, "index", Not("index")
# it works for nthreads == 1

else I get:

# ERROR: MethodError: reducing over an empty collection is not allowed; consider supplying `init` to the reducer
  [1] reduce_empty(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, #unused#::Type{Union{}})
    @ Base .\reduce.jl:356
  [2] reduce_empty_iter(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, itr::Tuple{}, #unused#::Base.HasEltype)
    @ Base .\reduce.jl:379
  [3] reduce_empty_iter(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, itr::Tuple{})
    @ Base .\reduce.jl:378
  [4] foldl_impl(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, nt::Base._InitialValue, itr::Tuple{})
    @ Base .\reduce.jl:49
  [5] mapfoldl_impl(f::SentinelArrays.var"#14#16", op::typeof(Base.add_sum), nt::Base._InitialValue, itr::Tuple{})
    @ Base .\reduce.jl:44
  [6] mapfoldl(f::Function, op::Function, itr::Tuple{}; init::Base._InitialValue)
    @ Base .\reduce.jl:170
  [7] mapfoldl
    @ .\reduce.jl:170 [inlined]
  [8] #mapreduce#292
    @ .\reduce.jl:302 [inlined]
  [9] mapreduce
    @ .\reduce.jl:302 [inlined]
 [10] #sum#295
    @ .\reduce.jl:530 [inlined]
 [11] sum(f::Function, a::Tuple{})
    @ Base .\reduce.jl:530
 [12] vcat(::SentinelArrays.ChainedVector{Int64, Vector{Int64}})
    @ SentinelArrays C:\Users\mario\.julia\packages\SentinelArrays\cav7N\src\chainedvector.jl:632
 [13] stack(df::DataFrame, measure_vars::String, id_vars::InvertedIndex{String}; variable_name::Symbol, value_name::Symbol, view::Bool, variable_eltype::Type)
    @ DataFrames C:\Users\mario\.julia\packages\DataFrames\LteEl\src\abstractdataframe\reshape.jl:172
 [14] stack(df::DataFrame, measure_vars::String, id_vars::InvertedIndex{String})
    @ DataFrames C:\Users\mario\.julia\packages\DataFrames\LteEl\src\abstractdataframe\reshape.jl:136
 [15] top-level scope
    @ d:\WIP\Git\joulia.jl\run2.jl:10

If the size of the DataFrame is higher (>= 10,000) the error comes regularly. For smaller numbers (like 1000) the error rarely occurs.

I am testing on:

Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-9900KF CPU @ 3.60GHz
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 16 on 16 virtual cores


  [336ed68f] CSV v0.10.11
  [a93c6f00] DataFrames v1.5.0

Is this a bug in DataFrames.jl or CSV.jl? Or is it related to something else?

This looks like a SentinelArrays bug, cc @quinnj.

If you collect the columns it doesn’t happen (maybe one of the keywords in CSV.File turns this off during parsing directly, have a look at the docs):

julia> df2.index = collect(df2.index); df2.parameter = collect(df2.parameter);