Errors in Parsers.PosLen loading dataframe from large CSV file (80M rows)

joshuarobinson · January 1, 2022, 3:56pm

Brand new to Julia and am unable to create a DataFrame from a test file that I’ve been using successfully with Pandas (if not slowly). The input is a single 5GB file with 80M lines and is space delimited. There are a few missing values but they seem to parse fine.

My code to read the CSV is:

df = DataFrame(CSV.File(input_path, header=["project", "url", "count", "bytes"], delim=" ", limit=100000))

If I leave the “limit” parameter, the DataFrame works great. But if I remove or increase very much, I get the following error:

ERROR: LoadError: ArgumentError: length argument to Parsers.PosLen (16340010) is too large; max length allowed is 1048575
Stacktrace:
  [1] lentoolarge(len::Int64)
    @ Parsers ~/.julia/packages/Parsers/a3jNK/src/utils.jl:302
  [2] PosLen
    @ ~/.julia/packages/Parsers/a3jNK/src/utils.jl:306 [inlined]
  [3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
    @ Parsers ~/.julia/packages/Parsers/a3jNK/src/strings.jl:262
  [4] xparse
    @ ~/.julia/packages/Parsers/a3jNK/src/strings.jl:3 [inlined]
  [5] parsevalue!(#unused#::Type{String}, buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context)
    @ CSV ~/.julia/packages/CSV/9LsxT/src/file.jl:817
  [6] parserow
    @ ~/.julia/packages/CSV/9LsxT/src/file.jl:688 [inlined]

I’m not sure how to tell what is going wrong here as this is my first foray into Julia, so I have theories that I’m not sure how to chase.

Is it a limitation in the number of rows that can be read at one time? Is the parser getting tripped up on a particular row that isn’t parsing correctly in Julia? The Pandas dataframe fits in memory, but is it possible Julia has tighter memory requirements? Is this just the wrong Julia API for loading large datasets?

Any suggestions on how to proceed would be appreciated, thanks

Bjorn_Madsen · January 2, 2022, 9:15am

Looks like integer overflow according to line 1.
I would file a bug report on GitHub

carl · April 24, 2022, 8:38pm

Getting pretty much exactly the same thing running this code:

df = CSV.read("file.csv", ignoreemptyrows=true, missingstring="None", select=["host", "id", "created_at", "in_reply_to_id", "url", "uri"], DataFrame, ntasks=1)

File seems quite similar to the other ones I’ve run this command on no problem. Here is the error output:

ERROR: ArgumentError: length argument to Parsers.PosLen (1749599) is too large; max length allowed is 1048575
Stacktrace:
  [1] lentoolarge(len::Int64)
    @ Parsers ~/.julia/packages/Parsers/4bHKe/src/utils.jl:302
  [2] PosLen
    @ ~/.julia/packages/Parsers/4bHKe/src/utils.jl:306 [inlined]
  [3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
    @ Parsers ~/.julia/packages/Parsers/4bHKe/src/strings.jl:288
  [4] xparse
    @ ~/.julia/packages/Parsers/4bHKe/src/strings.jl:3 [inlined]
  [5] parsevalue!
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:763 [inlined]
  [6] parserow
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:596 [inlined]
  [7] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, #unused#::Type{Tuple{}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:551
  [8] CSV.File(ctx::CSV.Context, chunking::Bool)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:291
  [9] File
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:226 [inlined]
 [10] #File#25
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:222 [inlined]
 [11] read(source::String, sink::Type; copycols::Bool, kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:ignoreemptyrows, :missingstring, :select, :ntasks), Tuple{Bool, String, Vector{String}, Int64}}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/CSV.jl:91
 [12] top-level scope
    @ REPL[7]:1

Don’t even know where to start with this one.

dlakelan · August 2, 2022, 4:09am

I’ve got the same thing going on, and in this case it’s with a public dataset so if someone wants to reproduce it:

grab this:
https://www2.census.gov/programs-surveys/acs/experimental/2020/data/pums/1-Year/csv_hus.zip

unzip it, and try to read the second csv file:

df = CSV.read("psam_husb.csv",DataFrame)

ERROR: TaskFailedException

    nested task error: ArgumentError: length argument to Parsers.PosLen (1048576) is too large; max length allowed is 1048575
    Stacktrace:
     [1] lentoolarge(len::Int64)
       @ Parsers /var/local/dlakelan/dotjulia/packages/Parsers/KmPKe/src/utils.jl:302
     [2] PosLen
       @ /var/local/dlakelan/dotjulia/packages/Parsers/KmPKe/src/utils.jl:306 [inlined]
     [3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
       @ Parsers /var/local/dlakelan/dotjulia/packages/Parsers/KmPKe/src/strings.jl:289
     [4] xparse
       @ /var/local/dlakelan/dotjulia/packages/Parsers/KmPKe/src/strings.jl:3 [inlined]
     [5] parserow
       @ ~/.julia/packages/CSV/jFiCn/src/file.jl:655 [inlined]
     [6] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, #unused#::Type{Tuple{}})
       @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:551
     [7] multithreadparse(ctx::CSV.Context, pertaskcolumns::Vector{Vector{CSV.Column}}, rowchunkguess::Int64, i::Int64, rows::Vector{Int64}, wholecolumnslock::ReentrantLock)
       @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:361
     [8] (::CSV.var"#27#32"{CSV.Context, Vector{Int64}, Vector{Vector{CSV.Column}}, ReentrantLock, Int64, Int64})()
       @ CSV ./threadingconstructs.jl:178

Any ideas?

Sukera · August 2, 2022, 6:44am

Publicly available datasets are good, could you open an issue at the CSV.jl repo referencing this thread?

jd-foster · August 2, 2022, 6:59am

I’ve tried to reproduce this but seems to work for me. Looks like we’re using the same package version but I’m on macOS with Julia 1.7.1
I wonder if it’s related to how the csv is unzipped…

jd-foster · August 2, 2022, 7:33am

Reported here: Parsing fails with long strings · Issue #1009 · JuliaData/CSV.jl · GitHub

dlakelan · August 2, 2022, 2:52pm

Ok, thanks to people who tried it. I deleted the zip file and re-downloaded and now it works. Sure enough there was a quote character at the end of line 14 of my unzipped file… weird hunh? bit-flips happen I guess.

Topic		Replies	Views
Issues reading big CSV file despite using CSV.Row Data memory , csv , rcall	14	1344	July 18, 2022
CSV and Parsers PosLen Length Future Support Data	3	292	October 13, 2022
CSV.jl skipto and limit keyword arguments Data dataframes , csv	9	1131	January 18, 2022
Reading huge csv files Data	5	4072	January 19, 2019
Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version Performance dataframes	22	2305	June 29, 2020

Errors in Parsers.PosLen loading dataframe from large CSV file (80M rows)

Related topics