CSV Error - Feather Error

I’m very new to Julia Programming.

I have a folder of CSV files (14) that I join to one big data frame and I’m trying to save the big CSV. (rows - 262673020 x columns - 77) when I use CSV.write - I get this ERROR: BoundsError: attempt to access 4194304-element Array{UInt8,1} at index [1:4194305].

So I tried to save it into a feather file but I getting this ERROR: InexactError: trunc(Int32, 2147483662) - This error looks to be reaching some 32 max. but not sure why

I’m not sure what is going on just need some help understand what to do.

MacBook Pro - Atom/Juno
Package - Julia Version 1.5.2 - CSV 0.7.7 - DataFrames v0.21.8 - Glob v1.3.0 - Tables v1.1.0 - Feather v0.5.6 -

using Glob, CSV, Tables, Feather

fileDirectory = "location/CSV"
files = glob("*.csv", fileDirectory)

list_df = [DataFrame(CSV.read(f)) for f in files]
Join_DF = join(list_df[3], list_df[4], list_df[5], list_df[6], list_df[7], list_df[8], list_df[9], list_df[10], list_df[11], list_df[12], list_df[13], list_df[14], on = :INC_KEY, kind = :outer)

Feather.write("location/join_files.feather", Join_DF) 
# ERROR: InexactError: trunc(Int32, 2147483662)

CSV.write("location/join_files.csv", Join_DF) 
# ERROR: BoundsError: attempt to access 4194304-element Array{UInt8,1} at index [1:4194305].
ERROR: BoundsError: attempt to access 4194304-element Array{UInt8,1} at index [1:4194305]
 [1] throw_boundserror(::Array{UInt8,1}, ::Tuple{UnitRange{Int64}}) at ./abstractarray.jl:541
 [2] checkbounds at ./abstractarray.jl:506 [inlined]
 [3] view at ./subarray.jl:158 [inlined]
 [4] writecell(::Array{UInt8,1}, ::Int64, ::Int64, ::IOStream, ::Int64, ::CSV.Options{UInt8,UInt8,Nothing,Tuple{}}) at /Users/.julia/packages/CSV/4GOjG/src/write.jl:147
 [5] #64 at /Users/.julia/packages/CSV/4GOjG/src/write.jl:182 [inlined]
 [6] macro expansion at /Users/.julia/packages/Tables/FXXeK/src/utils.jl:54 [inlined]
 [7] eachcolumn at /Users/.julia/packages/Tables/FXXeK/src/utils.jl:48 [inlined]
[8] writerow(::Array{UInt8,1}, ::Base.RefValue{Int64}, ::Int64, ::IOStream, ::Tables.Schema{(
[9] #55 at /Users/.julia/packages/CSV/4GOjG/src/write.jl:80 [inlined]
[10] (::CSV.var"#62#63"{CSV.var"#55#56"{Bool,Tables.Schema{(
[12] open(::Function, ::String, ::String) at ./io.jl:323
 [13] with at /Users/.julia/packages/CSV/4GOjG/src/write.jl:139 [inlined]
 [14] #write#54 at /Users/.julia/packages/CSV/4GOjG/src/write.jl:73 [inlined]
 [15] write(::Tables.Schema{(
 [16] write(::String, ::DataFrame; delim::Char, quotechar::Char, openquotechar::Nothing, closequotechar::Nothing, escapechar::Char, newline::Char, decimal::Char, dateformat::Nothing, quotestrings::Bool, missingstring::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/.julia/packages/CSV/4GOjG/src/write.jl:60
 [17] write(::String, ::DataFrame) at /Users/.julia/packages/CSV/4GOjG/src/write.jl:53
 [18] top-level scope at none:1
 [1] throw_inexacterror(::Symbol, ::Type{Int32}, ::Int64) at ./boot.jl:558
 [2] checked_trunc_sint at ./boot.jl:580 [inlined]
 [3] toInt32 at ./boot.jl:617 [inlined]
 [4] Int32 at ./boot.jl:707 [inlined]
 [5] convert at ./number.jl:7 [inlined]
 [6] setindex! at ./array.jl:847 [inlined]
 [7] offsets(::Type{Int32}, ::Type{UInt8}, ::PooledArrays.PooledArray{Union{Missing, String},UInt32,1,Array{UInt32,1}}) at /Users/.julia/packages/Arrow/q3tEJ/src/lists.jl:300
 [8] Arrow.NullableList{String,Int32,P} where P<:Arrow.AbstractPrimitive(::Type{UInt8}, ::PooledArrays.PooledArray{Union{Missing, String},UInt32,1,Array{UInt32,1}}) at /Users/.julia/packages/Arrow/q3tEJ/src/lists.jl:243
 [9] NullableList at /Users/.julia/packages/Arrow/q3tEJ/src/lists.jl:251 [inlined]
 [10] arrowformat at /Users/.julia/packages/Arrow/q3tEJ/src/arrowvectors.jl:242 [inlined]
 [11] getarrow(::PooledArrays.PooledArray{Union{Missing, String},UInt32,1,Array{UInt32,1}}) at /Users/.julia/packages/Feather/y64Pt/src/sink.jl:40
 [12] write(::IOStream, ::DataFrame; description::String, metadata::String) at /Users/.julia/packages/Feather/y64Pt/src/sink.jl:18
 [13] #20 at /Users/.julia/packages/Feather/y64Pt/src/sink.jl:32 [inlined]
 [14] open(::Feather.var"#20#21"{String,String,DataFrame}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:325
 [15] open at ./io.jl:323 [inlined]
 [16] #write#19 at /Users/.julia/packages/Feather/y64Pt/src/sink.jl:31 [inlined]
 [17] write(::String, ::DataFrame) at /Users/.julia/packages/Feather/y64Pt/src/sink.jl:31
 [18] top-level scope at none:1

cc @quinnj

Update - For the community thank to @bkamins

save_df = DataFrames._check_consistency(Join_DF) - (Nothing)

@bkamins @quinnj - Hello can you suggest some other files format that works with dataframes? I was thinking of trying JLD2 next.

CSV.jl should work. Let us just wait till @quinnj can have a look at your problem

@bkamins @quinnj Update - out of pure curious, I decided to try this problem in Jupyter notebook and it looks like it works. It made a file 203GB - So it looks like the problem is with Atom / Juno and I also installed Juliapro but I get some others errors with that so in my case it’s the atom package that does not work for me. Any ideas why this is happening?

There have been issues before with Juno/Atom, where the Julia extensions/plugin itself loads packages at specific versions, which then messes up loading those packages later; it’s really unfortunate because there’s no indication that the package was already loaded, and even worse, that oftentimes a different version of the package was already loaded. This leads to issues, in theory, where you’re running into “old bugs” because an older CSV.jl version is being loaded; bugs which have been fixed in newer versions.

Off the top of my head, that error doesn’t ring a bell for bugs that have been fixed, but this issue comes to mind, where a single row exceeds the 4MB default buffer size when writing. And on that note, I noticed that we actually weren’t allowing passing a bigger bufsize to CSV.write (we only allowed for CSV.RowWriter); so I opened this PR to allow passing CSV.write(file, tbl; bufsize=2^24), for example, to increase the buffer size for writing a single row. The reason for this is that we write to an internal buffer and “flush” the buffer to the output IO only after an entire row is written, so if a single row is too big, it will “overflow”.

On the Feather issue, it indeed has a limitation of using UInt32 for some internal length variables, which inherently limits the total size of files; this isn’t a Julia limitation, but mandated by the feather format itself. The more recent “upgrade” to feather is the apache/arrow project, including the julia implementation in Arrow.jl. It allows writing large tables in “batches”, which is currently supported in Arrow.write by inputs that are “partitioned” by overloading Tables.partitions. If you have a bunch of csv files, the Tables.jl package actually provides a Tables.partitioner, so you could do something like:

using Arrow, CSV, Tables, Glob

parts = Tables.partitioner(CSV.File, files)
Arrow.write("join_files.arrow", parts)

This will write each csv file as a separate “record batch” in the output arrow table, which can then be read back in, in Julia, like:

tbl = Arrow.read("join_files.arrow")

which will read all the record batches together as a single, “long” table.

The Arrow.jl package is fairly new, so let me know if you run into any issues.

1 Like

@quinnj @bkamins Once again thank you both for your time. @quinnj this is very interesting I will try it.