I’m trying to use Arrow’s Stream https://arrow.juliadata.org/dev/reference/#Arrow.Stream to get an efficient out of RAM tabular data iterator.
However, attempts to iterate over the arrow stream only appear to perform a single iteration, where the entire table was read.
For example, here’s the test data creation:
using DataFrames using Tables using Arrow using Random using BenchmarkTools Random.seed!(123) nobs = 4_000_000 nfeats = 100 path = "data/test.arrow" x = rand(nobs, nfeats) df = DataFrame(x, :auto) Arrow.write(path, df)
Now, by itearting over the Stream, the full table get materialized in a DataFrame in a single iteration:
for x in Arrow.Stream(path) df = DataFrame(x) sz = size(df) println("size: $sz") end size: (4000000, 100)
My wish regarding Arrow’s Stream was have something behaving like:
itr = Iterators.partition(stream, batch_size), so that each iterator would return a table/DataFrame with
The docs read
This allows iterating over extremely large "arrow tables" in chunks represented as record batches. which seems to point in that direction. I would appreciate to know how to achieve such batch processing.