Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project

Just wanted to say that I’ve been able to stream a large arrow file (54 Gb, 4.5 million rows, 2,200 columns) using Arrow.jl + TableOperations.jl to generate complex statistical aggregations with DataFrames.jl while keeping memory usage around 1 Gb. The resulting solution ended up working faster than some C++ code which was using a proprietary binary format, but now a much wider audience can make improvements to that process.

Got a lot of valuable information on how to do that on this thread: How well Apache Arrow’s zero copy methodology is supported? - Specific Domains / Data - Julia Programming Language (julialang.org).

18 Likes