Brand new to Julia and am unable to create a DataFrame from a test file that I’ve been using successfully with Pandas (if not slowly). The input is a single 5GB file with 80M lines and is space delimited. There are a few missing values but they seem to parse fine.
My code to read the CSV is:
df = DataFrame(CSV.File(input_path, header=["project", "url", "count", "bytes"], delim=" ", limit=100000))
If I leave the “limit” parameter, the DataFrame works great. But if I remove or increase very much, I get the following error:
ERROR: LoadError: ArgumentError: length argument to Parsers.PosLen (16340010) is too large; max length allowed is 1048575
Stacktrace:
[1] lentoolarge(len::Int64)
@ Parsers ~/.julia/packages/Parsers/a3jNK/src/utils.jl:302
[2] PosLen
@ ~/.julia/packages/Parsers/a3jNK/src/utils.jl:306 [inlined]
[3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
@ Parsers ~/.julia/packages/Parsers/a3jNK/src/strings.jl:262
[4] xparse
@ ~/.julia/packages/Parsers/a3jNK/src/strings.jl:3 [inlined]
[5] parsevalue!(#unused#::Type{String}, buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context)
@ CSV ~/.julia/packages/CSV/9LsxT/src/file.jl:817
[6] parserow
@ ~/.julia/packages/CSV/9LsxT/src/file.jl:688 [inlined]
I’m not sure how to tell what is going wrong here as this is my first foray into Julia, so I have theories that I’m not sure how to chase.
Is it a limitation in the number of rows that can be read at one time? Is the parser getting tripped up on a particular row that isn’t parsing correctly in Julia? The Pandas dataframe fits in memory, but is it possible Julia has tighter memory requirements? Is this just the wrong Julia API for loading large datasets?
Any suggestions on how to proceed would be appreciated, thanks