I would like to test Julia’s support for Apache Arrow with a big file that doesn’t fit to memory (RAM). I can’t find any example of such file, could you help me with this please?
Thanks for the super fast replies @quinnj and @StatisticalMouse . I did something that combined your hints: I downloaded parquet files with R (by following this Working with Arrow Datasets and dplyr • Arrow R Package ) and then tried to combine a big arrow file by using Parquet; Arrow.write("taxi.arrow", Tables.partitioner(read_parquet(".")));. These lines seem to crash Julia, I just get “Killed” message. I have Julia 1.6.0.
Hmmmm, not sure; it may be running out of memory. I don’t think Parquet.jl currently supports partitioned datasets, so I think it may be materializin the full parquet dataset in memory then trying to write it out to arrow memory.
I ended up doing the big arrow file with pyarrow, along with lines below:
with pa.output_stream("path/big.arrow") as sink:
with pa.ipc.new_file(sink, schema) as writer:
for arrowfile in glob.glob("path/to/files/*.arrow", recursive=False):
with pa.input_stream(arrowfile) as source:
with pa.ipc.open_file(source) as reader:
for i in range(0,reader.num_record_batches):
writer.write_batch(reader.get_batch(i))