Errors in Parsers.PosLen loading dataframe from large CSV file (80M rows)

Brand new to Julia and am unable to create a DataFrame from a test file that I’ve been using successfully with Pandas (if not slowly). The input is a single 5GB file with 80M lines and is space delimited. There are a few missing values but they seem to parse fine.

My code to read the CSV is:

df = DataFrame(CSV.File(input_path, header=["project", "url", "count", "bytes"], delim=" ", limit=100000))

If I leave the “limit” parameter, the DataFrame works great. But if I remove or increase very much, I get the following error:

ERROR: LoadError: ArgumentError: length argument to Parsers.PosLen (16340010) is too large; max length allowed is 1048575
Stacktrace:
  [1] lentoolarge(len::Int64)
    @ Parsers ~/.julia/packages/Parsers/a3jNK/src/utils.jl:302
  [2] PosLen
    @ ~/.julia/packages/Parsers/a3jNK/src/utils.jl:306 [inlined]
  [3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
    @ Parsers ~/.julia/packages/Parsers/a3jNK/src/strings.jl:262
  [4] xparse
    @ ~/.julia/packages/Parsers/a3jNK/src/strings.jl:3 [inlined]
  [5] parsevalue!(#unused#::Type{String}, buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context)
    @ CSV ~/.julia/packages/CSV/9LsxT/src/file.jl:817
  [6] parserow
    @ ~/.julia/packages/CSV/9LsxT/src/file.jl:688 [inlined]

I’m not sure how to tell what is going wrong here as this is my first foray into Julia, so I have theories that I’m not sure how to chase.

Is it a limitation in the number of rows that can be read at one time? Is the parser getting tripped up on a particular row that isn’t parsing correctly in Julia? The Pandas dataframe fits in memory, but is it possible Julia has tighter memory requirements? Is this just the wrong Julia API for loading large datasets?

Any suggestions on how to proceed would be appreciated, thanks

1 Like

Looks like integer overflow according to line 1.
I would file a bug report on GitHub

Getting pretty much exactly the same thing running this code:

df = CSV.read("file.csv", ignoreemptyrows=true, missingstring="None", select=["host", "id", "created_at", "in_reply_to_id", "url", "uri"], DataFrame, ntasks=1)

File seems quite similar to the other ones I’ve run this command on no problem. Here is the error output:

ERROR: ArgumentError: length argument to Parsers.PosLen (1749599) is too large; max length allowed is 1048575
Stacktrace:
  [1] lentoolarge(len::Int64)
    @ Parsers ~/.julia/packages/Parsers/4bHKe/src/utils.jl:302
  [2] PosLen
    @ ~/.julia/packages/Parsers/4bHKe/src/utils.jl:306 [inlined]
  [3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
    @ Parsers ~/.julia/packages/Parsers/4bHKe/src/strings.jl:288
  [4] xparse
    @ ~/.julia/packages/Parsers/4bHKe/src/strings.jl:3 [inlined]
  [5] parsevalue!
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:763 [inlined]
  [6] parserow
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:596 [inlined]
  [7] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, #unused#::Type{Tuple{}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:551
  [8] CSV.File(ctx::CSV.Context, chunking::Bool)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:291
  [9] File
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:226 [inlined]
 [10] #File#25
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:222 [inlined]
 [11] read(source::String, sink::Type; copycols::Bool, kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:ignoreemptyrows, :missingstring, :select, :ntasks), Tuple{Bool, String, Vector{String}, Int64}}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/CSV.jl:91
 [12] top-level scope
    @ REPL[7]:1

Don’t even know where to start with this one.

I’ve got the same thing going on, and in this case it’s with a public dataset so if someone wants to reproduce it:

grab this:
https://www2.census.gov/programs-surveys/acs/experimental/2020/data/pums/1-Year/csv_hus.zip

unzip it, and try to read the second csv file:

df = CSV.read("psam_husb.csv",DataFrame)

ERROR: TaskFailedException

    nested task error: ArgumentError: length argument to Parsers.PosLen (1048576) is too large; max length allowed is 1048575
    Stacktrace:
     [1] lentoolarge(len::Int64)
       @ Parsers /var/local/dlakelan/dotjulia/packages/Parsers/KmPKe/src/utils.jl:302
     [2] PosLen
       @ /var/local/dlakelan/dotjulia/packages/Parsers/KmPKe/src/utils.jl:306 [inlined]
     [3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
       @ Parsers /var/local/dlakelan/dotjulia/packages/Parsers/KmPKe/src/strings.jl:289
     [4] xparse
       @ /var/local/dlakelan/dotjulia/packages/Parsers/KmPKe/src/strings.jl:3 [inlined]
     [5] parserow
       @ ~/.julia/packages/CSV/jFiCn/src/file.jl:655 [inlined]
     [6] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, #unused#::Type{Tuple{}})
       @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:551
     [7] multithreadparse(ctx::CSV.Context, pertaskcolumns::Vector{Vector{CSV.Column}}, rowchunkguess::Int64, i::Int64, rows::Vector{Int64}, wholecolumnslock::ReentrantLock)
       @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:361
     [8] (::CSV.var"#27#32"{CSV.Context, Vector{Int64}, Vector{Vector{CSV.Column}}, ReentrantLock, Int64, Int64})()
       @ CSV ./threadingconstructs.jl:178

Any ideas?

Publicly available datasets are good, could you open an issue at the CSV.jl repo referencing this thread?

I’ve tried to reproduce this but seems to work for me. Looks like we’re using the same package version but I’m on macOS with Julia 1.7.1
I wonder if it’s related to how the csv is unzipped…

Reported here: Parsing fails with long strings · Issue #1009 · JuliaData/CSV.jl · GitHub

Ok, thanks to people who tried it. I deleted the zip file and re-downloaded and now it works. Sure enough there was a quote character at the end of line 14 of my unzipped file… weird hunh? bit-flips happen I guess.