Error when using a df load from a CSV and nthreads > 1

mariok90 · July 6, 2023, 10:43am

Hi,

I encountered an error when I load a CSV file into a DataFrame and use stack. I narrowed it down that it only occurs if I run Julia --threads=auto (Threads.nthreads() > 1).

I was able to reproduce the error in this MWE:

using CSV
using DataFrames

N = 100_000
df = DataFrame(index=1:N, parameter=rand(N))
CSV.write("test.csv", df)
df2 = CSV.read("test.csv", DataFrame)

stack(df, "index", Not("index"))
# this works
stack(df2, "index", Not("index")
# it works for nthreads == 1

else I get:

# ERROR: MethodError: reducing over an empty collection is not allowed; consider supplying `init` to the reducer
Stacktrace:
  [1] reduce_empty(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, #unused#::Type{Union{}})
    @ Base .\reduce.jl:356
  [2] reduce_empty_iter(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, itr::Tuple{}, #unused#::Base.HasEltype)
    @ Base .\reduce.jl:379
  [3] reduce_empty_iter(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, itr::Tuple{})
    @ Base .\reduce.jl:378
  [4] foldl_impl(op::Base.MappingRF{SentinelArrays.var"#14#16", Base.BottomRF{typeof(Base.add_sum)}}, nt::Base._InitialValue, itr::Tuple{})
    @ Base .\reduce.jl:49
  [5] mapfoldl_impl(f::SentinelArrays.var"#14#16", op::typeof(Base.add_sum), nt::Base._InitialValue, itr::Tuple{})
    @ Base .\reduce.jl:44
  [6] mapfoldl(f::Function, op::Function, itr::Tuple{}; init::Base._InitialValue)
    @ Base .\reduce.jl:170
  [7] mapfoldl
    @ .\reduce.jl:170 [inlined]
  [8] #mapreduce#292
    @ .\reduce.jl:302 [inlined]
  [9] mapreduce
    @ .\reduce.jl:302 [inlined]
 [10] #sum#295
    @ .\reduce.jl:530 [inlined]
 [11] sum(f::Function, a::Tuple{})
    @ Base .\reduce.jl:530
 [12] vcat(::SentinelArrays.ChainedVector{Int64, Vector{Int64}})
    @ SentinelArrays C:\Users\mario\.julia\packages\SentinelArrays\cav7N\src\chainedvector.jl:632
 [13] stack(df::DataFrame, measure_vars::String, id_vars::InvertedIndex{String}; variable_name::Symbol, value_name::Symbol, view::Bool, variable_eltype::Type)
    @ DataFrames C:\Users\mario\.julia\packages\DataFrames\LteEl\src\abstractdataframe\reshape.jl:172
 [14] stack(df::DataFrame, measure_vars::String, id_vars::InvertedIndex{String})
    @ DataFrames C:\Users\mario\.julia\packages\DataFrames\LteEl\src\abstractdataframe\reshape.jl:136
 [15] top-level scope
    @ d:\WIP\Git\joulia.jl\run2.jl:10

If the size of the DataFrame is higher (>= 10,000) the error comes regularly. For smaller numbers (like 1000) the error rarely occurs.

I am testing on:

Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-9900KF CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 16 on 16 virtual cores
Environment:
  JULIA_EDITOR = code

Pkg.status:

  [336ed68f] CSV v0.10.11
  [a93c6f00] DataFrames v1.5.0

Is this a bug in DataFrames.jl or CSV.jl? Or is it related to something else?

nilshg · July 6, 2023, 12:28pm

This looks like a SentinelArrays bug, cc @quinnj.

If you collect the columns it doesn’t happen (maybe one of the keywords in CSV.File turns this off during parsing directly, have a look at the docs):

julia> df2.index = collect(df2.index); df2.parameter = collect(df2.parameter);

Topic		Replies	Views
CSV, DataFrames problems with threads General Usage	2	616	November 21, 2019
Memory blow-up when passing DataFrame to function inside @threads loop Julia at Scale	1	553	April 2, 2019
Read csv files slow Performance filesystem	13	1705	July 28, 2020
Reading and processing Data files concurrently Data parallel	18	3810	September 20, 2017
ERROR: Task Failed Exception reading CSV New to Julia dataframes , csv	3	857	August 30, 2022

Error when using a df load from a CSV and nthreads > 1

Related topics