Errors in Parsers.PosLen loading dataframe from large CSV file (80M rows)

Brand new to Julia and am unable to create a DataFrame from a test file that I’ve been using successfully with Pandas (if not slowly). The input is a single 5GB file with 80M lines and is space delimited. There are a few missing values but they seem to parse fine.

My code to read the CSV is:

df = DataFrame(CSV.File(input_path, header=["project", "url", "count", "bytes"], delim=" ", limit=100000))

If I leave the “limit” parameter, the DataFrame works great. But if I remove or increase very much, I get the following error:

ERROR: LoadError: ArgumentError: length argument to Parsers.PosLen (16340010) is too large; max length allowed is 1048575
Stacktrace:
  [1] lentoolarge(len::Int64)
    @ Parsers ~/.julia/packages/Parsers/a3jNK/src/utils.jl:302
  [2] PosLen
    @ ~/.julia/packages/Parsers/a3jNK/src/utils.jl:306 [inlined]
  [3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
    @ Parsers ~/.julia/packages/Parsers/a3jNK/src/strings.jl:262
  [4] xparse
    @ ~/.julia/packages/Parsers/a3jNK/src/strings.jl:3 [inlined]
  [5] parsevalue!(#unused#::Type{String}, buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context)
    @ CSV ~/.julia/packages/CSV/9LsxT/src/file.jl:817
  [6] parserow
    @ ~/.julia/packages/CSV/9LsxT/src/file.jl:688 [inlined]

I’m not sure how to tell what is going wrong here as this is my first foray into Julia, so I have theories that I’m not sure how to chase.

Is it a limitation in the number of rows that can be read at one time? Is the parser getting tripped up on a particular row that isn’t parsing correctly in Julia? The Pandas dataframe fits in memory, but is it possible Julia has tighter memory requirements? Is this just the wrong Julia API for loading large datasets?

Any suggestions on how to proceed would be appreciated, thanks

1 Like

Looks like integer overflow according to line 1.
I would file a bug report on GitHub

Getting pretty much exactly the same thing running this code:

df = CSV.read("file.csv", ignoreemptyrows=true, missingstring="None", select=["host", "id", "created_at", "in_reply_to_id", "url", "uri"], DataFrame, ntasks=1)

File seems quite similar to the other ones I’ve run this command on no problem. Here is the error output:

ERROR: ArgumentError: length argument to Parsers.PosLen (1749599) is too large; max length allowed is 1048575
Stacktrace:
  [1] lentoolarge(len::Int64)
    @ Parsers ~/.julia/packages/Parsers/4bHKe/src/utils.jl:302
  [2] PosLen
    @ ~/.julia/packages/Parsers/4bHKe/src/utils.jl:306 [inlined]
  [3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
    @ Parsers ~/.julia/packages/Parsers/4bHKe/src/strings.jl:288
  [4] xparse
    @ ~/.julia/packages/Parsers/4bHKe/src/strings.jl:3 [inlined]
  [5] parsevalue!
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:763 [inlined]
  [6] parserow
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:596 [inlined]
  [7] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, #unused#::Type{Tuple{}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:551
  [8] CSV.File(ctx::CSV.Context, chunking::Bool)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:291
  [9] File
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:226 [inlined]
 [10] #File#25
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:222 [inlined]
 [11] read(source::String, sink::Type; copycols::Bool, kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:ignoreemptyrows, :missingstring, :select, :ntasks), Tuple{Bool, String, Vector{String}, Int64}}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/CSV.jl:91
 [12] top-level scope
    @ REPL[7]:1

Don’t even know where to start with this one.