Avro.jl : Problems appending and reading one record at a time

Jonm · July 31, 2021, 3:23pm

I want to serial write a sequence of Dict{String,Int64} objects to disc in Avro ad they are generated. Hence I need to beable to close and then append to the file. Hence Avro (if I understand) seems like a good choice.
I am using Avro.jl

I can’t seem to get the regular Avro.read to read more than the first record I write with Avro.write.

In the end I got this working by using Avro.readvalue which I found by reading the source code.

buff=Base.read(io)
obj, pos = Avro.readvalue(Avro.Binary(),schema,JuliaType,buff, 
            pos, length(buff),false)

the problem is that my files are quite large. around 5-8G. I really don’t want to read everything into memory into buff.

I can’t get the Avro.writetable working either without passing through DataFrames and that seems crazy. Also, I still have problems appending records.

Can I get Avro.read to read just the first Avro record and then read the next one the next time it is called ?

This works great in python. I am relatively new to julia.

Thanks in advance.

jzr · August 1, 2021, 9:54pm

I think @quinnj may be the relevant expert here.

I wonder if Arrow Stream might also be a good choice.

Jonm · August 2, 2021, 9:41pm

Thanks. I think I am switching to writing JSON’s line by line and compressing the whole file using a gzip stream. But now I am having issues with running tuples through JSON3

quinnj · August 2, 2021, 11:46pm

It sounds like avro can be a good data format for your use-case here. Currently, Avro.jl only supports writing multiple blocks like you suggested via the Avro.writetable/Avro.readtable interfaces. I’m not sure how you were writing to the file yourself, but I wouldn’t expect Avro.readvalue to then work at reading an arbitrary # of records. The avro format has the explicit “block” construct to allow arbitrary appending of record blocks in files, but it’s not really supported writing yourself.

One thing we could probably do is provide a more transactional way of writing, where you have an explicit “start writing”, “append N blocks of records”, “close writing” functions. If that would be useful, feel free to open an issue and I can look into it. We have something similar like this for Arrow.write already (allowing you to append a recordbatch to an existing file).

Alternatively, Avro.writetable does support the Tables.partitions interface on inputs, so you could restructure your writing so the dicts are generated via Tables.partitioner, something like:

Avro.writetable(
    "data.avro",
    Tables.partitioner(1:N) do i
        # generate dict here
        # but need to return a valid "table", simplest is Vector{NamedTuple}
    )
)

This means that each “partition” (generated dict) will be written out to the file, one at a time.

You can then read back in the data by doing tbl = Avro.readtable("data.avro"), which uses a lazy reading, so even larger-than-RAM files should be fine.

quinnj · August 2, 2021, 11:46pm

Feel free to open a JSON3.jl issue or post here with details; happy to help brainstorm the best approach here too.

jzr · August 3, 2021, 12:34am

For @quinnj and future readers following along, that JSON3 discussion is in

Topic		Replies	Views
A way to push to a JSON IO like with a collection General Usage question , json	10	1752	February 21, 2019
Ingesting data to JuliaDB without .csv files Data question	4	1309	August 30, 2018
Write data to Arrow file row by row General Usage arrow	7	1826	April 7, 2023
Recommended way to save and read DataFrames in JSON format Data	8	3742	February 12, 2018
What fileformat to use to load data for high performance computing Machine Learning	37	7140	December 1, 2018

Avro.jl : Problems appending and reading one record at a time

Related topics