Hi,
I encountered an error when I load a CSV file into a DataFrame and use stack
. I narrowed it down that it only occurs if I run Julia --threads=auto (Threads.nthreads() > 1).
I was able to reproduce the error in this MWE:
using CSV
using DataFrames
N = 100_000
df = DataFrame(index=1:N, parameter=rand(N))
CSV.write("test.csv", df)
df2 = CSV.read("test.csv", DataFrame)
stack(df, "index", Not("index"))
# this works
stack(df2, "index", Not("index")
# it works for nthreads == 1
else I get:
# ERROR: MethodError: reducing over an empty collection is not allowed; consider supplying `init` to the reducer
Stacktrace:
[1] reduce_empty(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, #unused#::Type{Union{}})
@ Base .\reduce.jl:356
[2] reduce_empty_iter(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, itr::Tuple{}, #unused#::Base.HasEltype)
@ Base .\reduce.jl:379
[3] reduce_empty_iter(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, itr::Tuple{})
@ Base .\reduce.jl:378
[4] foldl_impl(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, nt::Base._InitialValue, itr::Tuple{})
@ Base .\reduce.jl:49
[5] mapfoldl_impl(f::SentinelArrays.var"#14#16", op::typeof(Base.add_sum), nt::Base._InitialValue, itr::Tuple{})
@ Base .\reduce.jl:44
[6] mapfoldl(f::Function, op::Function, itr::Tuple{}; init::Base._InitialValue)
@ Base .\reduce.jl:170
[7] mapfoldl
@ .\reduce.jl:170 [inlined]
[8] #mapreduce#292
@ .\reduce.jl:302 [inlined]
[9] mapreduce
@ .\reduce.jl:302 [inlined]
[10] #sum#295
@ .\reduce.jl:530 [inlined]
[11] sum(f::Function, a::Tuple{})
@ Base .\reduce.jl:530
[12] vcat(::SentinelArrays.ChainedVector{Int64, Vector{Int64}})
@ SentinelArrays C:\Users\mario\.julia\packages\SentinelArrays\cav7N\src\chainedvector.jl:632
[13] stack(df::DataFrame, measure_vars::String, id_vars::InvertedIndex{String}; variable_name::Symbol, value_name::Symbol, view::Bool, variable_eltype::Type)
@ DataFrames C:\Users\mario\.julia\packages\DataFrames\LteEl\src\abstractdataframe\reshape.jl:172
[14] stack(df::DataFrame, measure_vars::String, id_vars::InvertedIndex{String})
@ DataFrames C:\Users\mario\.julia\packages\DataFrames\LteEl\src\abstractdataframe\reshape.jl:136
[15] top-level scope
@ d:\WIP\Git\joulia.jl\run2.jl:10
If the size of the DataFrame is higher (>= 10,000) the error comes regularly. For smaller numbers (like 1000) the error rarely occurs.
I am testing on:
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 16 × Intel(R) Core(TM) i9-9900KF CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 16 on 16 virtual cores
Environment:
JULIA_EDITOR = code
Pkg.status:
[336ed68f] CSV v0.10.11
[a93c6f00] DataFrames v1.5.0
Is this a bug in DataFrames.jl or CSV.jl? Or is it related to something else?