Writing Arrow files by column

djholiver · May 7, 2024, 4:22pm

Hi,

It appears (and I have observed) that Arrow.jl requires the entire data to be held in memory before flushing to disk. This is prohibitively big for my use case (500GB+) and the use of Chained Vectors for batch writing makes reading quite slow.

I am not comfortable enough with Julia IO to interpret how to write in column batches, but assume this would be possible - I’ll attempt to do so if someone could provide hints on how to start (or confirm it to be impossible).

Regards,

TimG · May 8, 2024, 2:54pm

Does Arrow.append do what you want?

I have tried using Arrow.append but, so far, without success. I’m reading data from a couple of hundred separate csv files all with the same columns and trying to consolidate them into a single Arrow file. However, differences between files in the data held in each column (in some files, some columns contain entirely missing data, for example) means I always get an error relating to inconsistent Arrow schemas. I haven’t yet had time to work out how to address this.

Topic		Replies	Views
Write data to Arrow file row by row General Usage arrow	7	1826	April 7, 2023
[ANN] Arrow.jl 0.3 Release Data arrow	21	3254	March 16, 2021
Writing Arrow record batch requires a lot of RAM Data arrow	5	1026	September 19, 2021
How well Apache Arrow’s zero copy methodology is supported? Data arrow	24	2789	May 1, 2021
General Arrow questions General Usage question , arrow	7	901	February 28, 2022

Writing Arrow files by column

Related topics