Serialization format allow incremental write to file

Roger-luo · March 2, 2023, 7:07pm

Is there any data format that supports incremental writes and allows flush to the file after the write? I find julia-arrow support incremental writes here: refactor Arrow.write to support incremental writes by baumgold · Pull Request #277 · apache/arrow-julia · GitHub

but it does not allow me to flush io to the file after the write without closing the file io. Any idea if this is supported? Or is there a different format allows me to do this?

bilderbuchi · March 2, 2023, 7:37pm

ASDF - the Advanced Scientific Data Format AFAIK has streaming writes, at least according to Low-level file layout — ASDF Standard 1.6.0 documentation / Introduction — ASDF Standard 1.6.0 documentation
Don’t know about the flushing, though.

There’s a Julia package by @schnetter at GitHub - eschnett/ASDF.jl: A Julia implementation of the Advanced Scientific Data Format (ASDF), but I don’t know its status.

quinnj · March 2, 2023, 8:09pm

Why can’t you flush after an incremental write w/ Arrow.jl? You can pass your own IO to Arrow.append and then call flush(io) yourself?

Roger-luo · March 2, 2023, 8:35pm

I tried that but somehow didn’t work? e.g

using Arrow
using Tables

row_A = (field=[1.0, 2.0], temp=[1.0, 1.0], energy=[-0.0, -0.0])
row_B = (field=[3.0, 2.0], temp=[1.0, 1.0], energy=[-0.0, -0.0])

io = open("test.arrow", "w")
Arrow.append(io, row_A)
flush(io)
tbl = Arrow.Table("test.arrow") # this has two rows
Arrow.append(io, row_B)
flush(io)
tbl = Arrow.Table("test.arrow") # this still have two rows does not have B
close(io)

schnetter · March 3, 2023, 1:16pm

ASDF.jl should be working, but I am not using it any more. I switched to using ADIOS2.jl as file format, which has many more features.

I am using ADIOS2 when running simulations of PDEs. Every few iterations one writes some variables to the file and flushes them. This use case is very efficient with ADIOS2. In other respects, ADIOS2 is similar to HDF5, in that it is designed to hold multi-dimensional arrays with attributes.

-erik

jling · March 3, 2023, 1:19pm

I suspect what’s happening there is file is flushed (check file size?), but the metadata isn’t updated until file is closed.

this is very common, we don’t want to re-locate / re-write metadata chunk every time we flush I think?

Roger-luo · March 3, 2023, 3:33pm

I mean if that’s the case how do I read my data back if my program crash without metadata?

Roger-luo · March 3, 2023, 3:33pm

Thanks! This seems to be what I want

jling · March 3, 2023, 4:19pm

I don’t know if Arrow.jl is at fault here (our implementation is bad) or it’s a general Arrow design issue – they may not have crash recovery as a design goal.

For the closely related Parquet format, it seems to be a thing: Error Recovery | Apache Parquet

Roger-luo · March 12, 2023, 7:00pm

OK, I think I just did this on my own - a custom data format that given the data I’d like to flush to disk is quite simple. I don’t believe Arrow works out for me in the end. But still thanks to everyone’s replies here.

jling · March 12, 2023, 7:11pm

@quinnj I think it’s pretty important to support incremental write?

Topic		Replies	Views
Write data to Arrow file row by row General Usage arrow	7	1826	April 7, 2023
New package ASDF.jl Data package , announcement	0	839	October 3, 2018
Arrow, Feather, and Parquet Data parquet , arrow	48	13148	November 1, 2020
Writing Arrow files by column Performance	1	181	May 8, 2024
Benchmarking ways to write/load DataFrames IndexedTables to disk Data	42	7094	October 25, 2018

Serialization format allow incremental write to file

Related topics