Writing Arrow files by column


It appears (and I have observed) that Arrow.jl requires the entire data to be held in memory before flushing to disk. This is prohibitively big for my use case (500GB+) and the use of Chained Vectors for batch writing makes reading quite slow.

I am not comfortable enough with Julia IO to interpret how to write in column batches, but assume this would be possible - I’ll attempt to do so if someone could provide hints on how to start (or confirm it to be impossible).


1 Like

Does Arrow.append do what you want?

I have tried using Arrow.append but, so far, without success. I’m reading data from a couple of hundred separate csv files all with the same columns and trying to consolidate them into a single Arrow file. However, differences between files in the data held in each column (in some files, some columns contain entirely missing data, for example) means I always get an error relating to inconsistent Arrow schemas. I haven’t yet had time to work out how to address this.