I have a large amount of data I am loading from an API and I would like to write the output of each call to a “row” of a file in order not to keep it in RAM/ in case the program errors. I have been using the Arrow.jl package for data reading/writing needs recently and really like it. However, I have not found a way to append rows to an Arrow file and was wondering if that is possible?
Thanks!
Here is my current example that does not work:
using Arrow, Tables
dat = [(a=1, b=2), (a=3,b=4)]
open("test.feather", "a+") do io
for row in dat
Arrow.write(io, [row])
end
end
Arrow.Table("test.feather") |> Tables.rowtable
Just in case it is helpful to anyone. As noted above I think this can be done now if you put the named tuples in a vector.
so something like
dat = [[(a=1, b=2)], [(a=3,b=4)]]
for d = dat
Arrow.append("newfilepath",d)
end
bkamins has a good example and explanation of Arrow.appendhere.
Also if you are appending to a pre-existing Arrow file, it must be written like Arrow.write(filename::String, tbl; file=false) with the file keyword argument set to false for append to work. See API Reference · Arrow.jl
I think this will produce a file with excessive metadata. You should minimize the time append happens because each time a new “batch” is written to the file which implies more metadata