Serialization format allow incremental write to file

Is there any data format that supports incremental writes and allows flush to the file after the write? I find julia-arrow support incremental writes here: refactor Arrow.write to support incremental writes by baumgold · Pull Request #277 · apache/arrow-julia · GitHub

but it does not allow me to flush io to the file after the write without closing the file io. Any idea if this is supported? Or is there a different format allows me to do this?

ASDF - the Advanced Scientific Data Format AFAIK has streaming writes, at least according to Low-level file layout — ASDF Standard 1.6.0 documentation / Introduction — ASDF Standard 1.6.0 documentation
Don’t know about the flushing, though.

There’s a Julia package by @schnetter at GitHub - eschnett/ASDF.jl: A Julia implementation of the Advanced Scientific Data Format (ASDF), but I don’t know its status.

Why can’t you flush after an incremental write w/ Arrow.jl? You can pass your own IO to Arrow.append and then call flush(io) yourself?

I tried that but somehow didn’t work? e.g

using Arrow
using Tables

row_A = (field=[1.0, 2.0], temp=[1.0, 1.0], energy=[-0.0, -0.0])
row_B = (field=[3.0, 2.0], temp=[1.0, 1.0], energy=[-0.0, -0.0])

io = open("test.arrow", "w")
Arrow.append(io, row_A)
flush(io)
tbl = Arrow.Table("test.arrow") # this has two rows
Arrow.append(io, row_B)
flush(io)
tbl = Arrow.Table("test.arrow") # this still have two rows does not have B
close(io)
1 Like

ASDF.jl should be working, but I am not using it any more. I switched to using ADIOS2.jl as file format, which has many more features.

I am using ADIOS2 when running simulations of PDEs. Every few iterations one writes some variables to the file and flushes them. This use case is very efficient with ADIOS2. In other respects, ADIOS2 is similar to HDF5, in that it is designed to hold multi-dimensional arrays with attributes.

-erik

I suspect what’s happening there is file is flushed (check file size?), but the metadata isn’t updated until file is closed.

this is very common, we don’t want to re-locate / re-write metadata chunk every time we flush I think?

I mean if that’s the case how do I read my data back if my program crash without metadata?

Thanks! This seems to be what I want

I don’t know if Arrow.jl is at fault here (our implementation is bad) or it’s a general Arrow design issue – they may not have crash recovery as a design goal.

For the closely related Parquet format, it seems to be a thing: Error Recovery | Apache Parquet

1 Like

OK, I think I just did this on my own - a custom data format that given the data I’d like to flush to disk is quite simple. I don’t believe Arrow works out for me in the end. But still thanks to everyone’s replies here.

@quinnj I think it’s pretty important to support incremental write?