I’m trying to use Arrow’s Stream https://arrow.juliadata.org/dev/reference/#Arrow.Stream to get an efficient out of RAM tabular data iterator.
However, attempts to iterate over the arrow stream only appear to perform a single iteration, where the entire table was read.
For example, here’s the test data creation:
using DataFrames
using Tables
using Arrow
using Random
using BenchmarkTools
Random.seed!(123)
nobs = 4_000_000
nfeats = 100
path = "data/test.arrow"
x = rand(nobs, nfeats)
df = DataFrame(x, :auto)
Arrow.write(path, df)
Now, by itearting over the Stream, the full table get materialized in a DataFrame in a single iteration:
for x in Arrow.Stream(path)
df = DataFrame(x)
sz = size(df)
println("size: $sz")
end
size: (4000000, 100)
My wish regarding Arrow’s Stream was have something behaving like: itr = Iterators.partition(stream, batch_size)
, so that each iterator would return a table/DataFrame with batch_size
rows.
The docs read This allows iterating over extremely large "arrow tables" in chunks represented as record batches.
which seems to point in that direction. I would appreciate to know how to achieve such batch processing.