Write data to Arrow file row by row

I have a large amount of data I am loading from an API and I would like to write the output of each call to a “row” of a file in order not to keep it in RAM/ in case the program errors. I have been using the Arrow.jl package for data reading/writing needs recently and really like it. However, I have not found a way to append rows to an Arrow file and was wondering if that is possible?

Thanks!

Here is my current example that does not work:

using Arrow, Tables

dat = [(a=1, b=2), (a=3,b=4)]
open("test.feather", "a+") do io
    for row in dat
        Arrow.write(io, [row])
    end
end

Arrow.Table("test.feather") |> Tables.rowtable
1 Like

Did you find an answer for this?

https://stackoverflow.com/questions/66388141/how-to-append-a-dataframe-to-an-existing-apache-arrow-file-on-disk

I think the answer is no

I ended up using the jsonlines format instead. If the file ends up too large I use it with GZip.jl

I don’t see it in the docs, but

added some support for this.

1 Like

Just in case it is helpful to anyone. As noted above I think this can be done now if you put the named tuples in a vector.

so something like

dat = [[(a=1, b=2)], [(a=3,b=4)]]
for d = dat
     Arrow.append("newfilepath",d)
end

bkamins has a good example and explanation of Arrow.append here.

Also if you are appending to a pre-existing Arrow file, it must be written like Arrow.write(filename::String, tbl; file=false) with the file keyword argument set to false for append to work. See API Reference · Arrow.jl

I think this will produce a file with excessive metadata. You should minimize the time append happens because each time a new “batch” is written to the file which implies more metadata

1 Like

Thanks so much for pointing this out! Sorry been out with a stomach bug so just saw this.