Just wanted to say that I’ve been able to stream a large arrow file (54 Gb, 4.5 million rows, 2,200 columns) using Arrow.jl
+ TableOperations.jl
to generate complex statistical aggregations with DataFrames.jl
while keeping memory usage around 1 Gb. The resulting solution ended up working faster than some C++ code which was using a proprietary binary format, but now a much wider audience can make improvements to that process.
Got a lot of valuable information on how to do that on this thread: How well Apache Arrow’s zero copy methodology is supported? - Specific Domains / Data - Julia Programming Language (julialang.org).