CSV.jl skipto and limit keyword arguments

I have another problem with CSV.jl and DataFrames.jl, again I figure out the problem is due to skipto/limit keyword in CSV, since I encounter a problem a day (I haven’t reported all of them but will do it from now on) my actual question is "am I the only user of these products, or any one else using them? :wink: "

this is the error message for the latest one, I think/ not sure it’s a bug - data have more than 300K rows

 df = CSV.read("data.csv", DataFrame, skipto = 100000, limit = 5000)
ERROR: InexactError: trunc(Int64, NaN)
Stacktrace:
 [1] trunc
   @ ./float.jl:716 [inlined]
 [2] ceil(#unused#::Type{Int64}, x::Float64)
   @ Base ./float.jl:295
 [3] CSV.Context(source::CSV.Arg, header::CSV.Arg, normalizenames::CSV.Arg, datarow::CSV.Arg, skipto::CSV.Arg, footerskip::CSV.Arg, transpose::CSV.Arg, comment::CSV.Arg, ignoreemptyrows::CSV.Arg, ignoreemptylines::CSV.Arg, select::CSV.Arg, drop::CSV.Arg, limit::CSV.Arg, buffer_in_memory::CSV.Arg, threaded::CSV.Arg, ntasks::CSV.Arg, tasks::CSV.Arg, rows_to_check::CSV.Arg, lines_to_check::CSV.Arg, missingstrings::CSV.Arg, missingstring::CSV.Arg, delim::CSV.Arg, ignorerepeated::CSV.Arg, quoted::CSV.Arg, quotechar::CSV.Arg, openquotechar::CSV.Arg, closequotechar::CSV.Arg, escapechar::CSV.Arg, dateformat::CSV.Arg, dateformats::CSV.Arg, decimal::CSV.Arg, truestrings::CSV.Arg, falsestrings::CSV.Arg, type::CSV.Arg, types::CSV.Arg, typemap::CSV.Arg, pool::CSV.Arg, downcast::CSV.Arg, lazystrings::CSV.Arg, stringtype::CSV.Arg, strict::CSV.Arg, silencewarnings::CSV.Arg, maxwarnings::CSV.Arg, debug::CSV.Arg, parsingdebug::CSV.Arg, validate::CSV.Arg, streaming::CSV.Arg)
   @ CSV ~/.julia/packages/CSV/9LsxT/src/context.jl:631
 [4] #File#25
   @ ~/.julia/packages/CSV/9LsxT/src/file.jl:220 [inlined]
 [5] read(source::String, sink::Type; copycols::Bool, kwargs::Base.Iterators.Pairs{Symbol, Int64, Tuple{Symbol, Symbol}, NamedTuple{(:skipto, :limit), Tuple{Int64, Int64}}})
   @ CSV ~/.julia/packages/CSV/9LsxT/src/CSV.jl:91
 [6] top-level scope
   @ REPL[11]:1

looks like a CSV detection bug, where it has determined a column should be Int64 then runs into a cell with NaN:
https://github.com/JuliaData/CSV.jl/issues/705

what version of CSV are you running?

I assure you people use CSV.jl every day.

In your last question about CSV.jl I encouraged you to debug in a new environment to help harrow down the error. Have you done this?

2 Likes

This is interesting one(is it a bug / again not sure), this one was a little difficult to figure out at the beginning (there were many lines of code between read and allowmissing! ), but finally I found it’s related to CSV.jl

julia> using DataFrames
julia> using CSV
julia> df = CSV.read("data.csv", DataFrame, skipto = 1000, limit = 10000, pool = false)
julia> allowmissing!(df)
ERROR: DimensionMismatch("axes must agree, got (Base.OneTo(8886),) and (Base.OneTo(10000),)")
Stacktrace:
  [1] (::Base.var"#checkaxs#111")(axd::Tuple{Base.OneTo{Int64}}, axs::Tuple{Base.OneTo{Int64}})
    @ Base ./abstractarray.jl:1053
  [2] copyto_axcheck!
    @ ./abstractarray.jl:1055 [inlined]
  [3] AbstractVector{Union{Missing, String31}}(A::SentinelArrays.ChainedVector{String31, Vector{String31}})
    @ Base ./array.jl:541
  [4] AbstractArray
    @ ./boot.jl:475 [inlined]
  [5] convert
    @ ./abstractarray.jl:15 [inlined]
  [6] allowmissing(x::SentinelArrays.ChainedVector{String31, Vector{String31}})
    @ Missings ~/.julia/packages/Missings/r1STI/src/Missings.jl:34
  [7] allowmissing!(df::DataFrame, col::Int64)
    @ DataFrames ~/.julia/packages/DataFrames/BM4OQ/src/dataframe/dataframe.jl:1151
  [8] allowmissing!
    @ ~/.julia/packages/DataFrames/BM4OQ/src/dataframe/dataframe.jl:1157 [inlined]
  [9] allowmissing! (repeats 2 times)
    @ ~/.julia/packages/DataFrames/BM4OQ/src/dataframe/dataframe.jl:1173 [inlined]
 [10] top-level scope
    @ REPL[6]:1

yes, it was a bug and an issue filled for it. BTW i am not very good in debugging :smiley:

version of CSV.jl is 0.9.11

Do

] activate --temp
] add CSV
] add DataFrames

and then re-try your code to see if it’s fixed. This creates a new environment for you to confirm which versions your code works on.

2 Likes

I did this, now CSV is version 0.10.1, DataFrames.jl is 1.3.1

Both problems reported in this thread are happening in this debug mode.

8 posts were split to a new topic: Arguments over a topic title rename

@xinchin:

  1. can you please open an issue in CSV.jl reporting the bug? I have shown you last time how to do it.
  2. Your problems are connected with the skipto/limit kwargs in combination with reading data using multiple threads; at least in my experience using such a combination is not super common (most of the time people read in the whole file without limiting it). This is likely the reason why the issue was not encountered earlier.
  3. In general CSV.jl reader supports 29 kwargs that control how data reading can be done. Assuming they could take two values (in practice they take more), you have 536 870 912 possible combinations of parser configuration. Most of them are probably never used. Of course for this reason any bug reports are highly appreciated.
  4. For the reason described in point 3. above CSV.jl has not hit 1.0 release yet - the number of available options is so large that the CSV.jl maintainers know that it requires a lot of testing, which is signaled by the version number before 1.0 (it does not mean that the reader does not work - for “standard” configurations I have been using it for years and it works very well)

CC @quinnj

10 Likes