Write data to Arrow file row by row

danielw2904 · April 13, 2021, 7:40am

I have a large amount of data I am loading from an API and I would like to write the output of each call to a “row” of a file in order not to keep it in RAM/ in case the program errors. I have been using the Arrow.jl package for data reading/writing needs recently and really like it. However, I have not found a way to append rows to an Arrow file and was wondering if that is possible?

Thanks!

Here is my current example that does not work:

using Arrow, Tables

dat = [(a=1, b=2), (a=3,b=4)]
open("test.feather", "a+") do io
    for row in dat
        Arrow.write(io, [row])
    end
end

Arrow.Table("test.feather") |> Tables.rowtable

davibarreira · May 30, 2022, 8:38pm

Did you find an answer for this?

nilshg · May 31, 2022, 4:33am

https://stackoverflow.com/questions/66388141/how-to-append-a-dataframe-to-an-existing-apache-arrow-file-on-disk

I think the answer is no

danielw2904 · May 31, 2022, 4:53am

I ended up using the jsonlines format instead. If the file ends up too large I use it with GZip.jl

ericphanson · May 31, 2022, 10:42am

I don’t see it in the docs, but

https://github.com/apache/arrow-julia/pull/160 and
refactor Arrow.write to support incremental writes by baumgold · Pull Request #277 · apache/arrow-julia · GitHub

added some support for this.

phantom · April 4, 2023, 1:04am

Just in case it is helpful to anyone. As noted above I think this can be done now if you put the named tuples in a vector.

so something like

dat = [[(a=1, b=2)], [(a=3,b=4)]]
for d = dat
     Arrow.append("newfilepath",d)
end

bkamins has a good example and explanation of Arrow.append here.

Also if you are appending to a pre-existing Arrow file, it must be written like Arrow.write(filename::String, tbl; file=false) with the file keyword argument set to false for append to work. See API Reference · Arrow.jl

jling · April 4, 2023, 1:33am

I think this will produce a file with excessive metadata. You should minimize the time append happens because each time a new “batch” is written to the file which implies more metadata

phantom · April 7, 2023, 9:31pm

Thanks so much for pointing this out! Sorry been out with a stomach bug so just saw this.

Topic		Replies	Views
Writing Arrow files by column Performance	1	181	May 8, 2024
Serialization format allow incremental write to file General Usage serialization	10	449	March 12, 2023
Arrow.write on NamedTuple General Usage	2	428	February 3, 2021
An example of Apache Arrow file? Data arrow	7	2954	April 22, 2021
[ANN] Arrow.jl 0.3 Release Data arrow	21	3254	March 16, 2021

Write data to Arrow file row by row

Related topics