Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project

Marcelo_Simas · April 18, 2024, 11:38pm

Just wanted to say that I’ve been able to stream a large arrow file (54 Gb, 4.5 million rows, 2,200 columns) using Arrow.jl + TableOperations.jl to generate complex statistical aggregations with DataFrames.jl while keeping memory usage around 1 Gb. The resulting solution ended up working faster than some C++ code which was using a proprietary binary format, but now a much wider audience can make improvements to that process.

Got a lot of valuable information on how to do that on this thread: How well Apache Arrow’s zero copy methodology is supported? - Specific Domains / Data - Julia Programming Language (julialang.org).

Topic		Replies	Views
[ANN] Arrow.jl 0.3 Release Data arrow	21	3168	March 16, 2021
Unable to write DataFrame to Parquet or Arrow? Data question	7	604	July 27, 2021
[ANN] Parquet2.jl Package Announcements data , parquet , tables , serialization	20	7388	May 8, 2024
What are the extensions to Arrow (Feather) and Parquet and are they supported in Julia? Offtopic	0	336	October 28, 2022
File IO - Parquet File Reader Data	4	1194	October 30, 2018

Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project

Related topics