I want to serial write a sequence of Dict{String,Int64} objects to disc in Avro ad they are generated. Hence I need to beable to close and then append to the file. Hence Avro (if I understand) seems like a good choice.
I am using Avro.jl
I can’t seem to get the regular Avro.read to read more than the first record I write with Avro.write.
In the end I got this working by using Avro.readvalue which I found by reading the source code.
Thanks. I think I am switching to writing JSON’s line by line and compressing the whole file using a gzip stream. But now I am having issues with running tuples through JSON3
It sounds like avro can be a good data format for your use-case here. Currently, Avro.jl only supports writing multiple blocks like you suggested via the Avro.writetable/Avro.readtable interfaces. I’m not sure how you were writing to the file yourself, but I wouldn’t expect Avro.readvalue to then work at reading an arbitrary # of records. The avro format has the explicit “block” construct to allow arbitrary appending of record blocks in files, but it’s not really supported writing yourself.
One thing we could probably do is provide a more transactional way of writing, where you have an explicit “start writing”, “append N blocks of records”, “close writing” functions. If that would be useful, feel free to open an issue and I can look into it. We have something similar like this for Arrow.write already (allowing you to append a recordbatch to an existing file).
Alternatively, Avro.writetabledoes support the Tables.partitions interface on inputs, so you could restructure your writing so the dicts are generated via Tables.partitioner, something like:
Avro.writetable(
"data.avro",
Tables.partitioner(1:N) do i
# generate dict here
# but need to return a valid "table", simplest is Vector{NamedTuple}
)
)
This means that each “partition” (generated dict) will be written out to the file, one at a time.
You can then read back in the data by doing tbl = Avro.readtable("data.avro"), which uses a lazy reading, so even larger-than-RAM files should be fine.