Avro.jl : Problems appending and reading one record at a time

I want to serial write a sequence of Dict{String,Int64} objects to disc in Avro ad they are generated. Hence I need to beable to close and then append to the file. Hence Avro (if I understand) seems like a good choice.
I am using Avro.jl

I can’t seem to get the regular Avro.read to read more than the first record I write with Avro.write.

In the end I got this working by using Avro.readvalue which I found by reading the source code.

buff=Base.read(io)
obj, pos = Avro.readvalue(Avro.Binary(),schema,JuliaType,buff, 
            pos, length(buff),false)

the problem is that my files are quite large. around 5-8G. I really don’t want to read everything into memory into buff.

I can’t get the Avro.writetable working either without passing through DataFrames and that seems crazy. Also, I still have problems appending records.

Can I get Avro.read to read just the first Avro record and then read the next one the next time it is called ?

This works great in python. I am relatively new to julia.

Thanks in advance.

1 Like

I think @quinnj may be the relevant expert here.

I wonder if Arrow Stream might also be a good choice.

Thanks. I think I am switching to writing JSON’s line by line and compressing the whole file using a gzip stream. But now I am having issues with running tuples through JSON3

It sounds like avro can be a good data format for your use-case here. Currently, Avro.jl only supports writing multiple blocks like you suggested via the Avro.writetable/Avro.readtable interfaces. I’m not sure how you were writing to the file yourself, but I wouldn’t expect Avro.readvalue to then work at reading an arbitrary # of records. The avro format has the explicit “block” construct to allow arbitrary appending of record blocks in files, but it’s not really supported writing yourself.

One thing we could probably do is provide a more transactional way of writing, where you have an explicit “start writing”, “append N blocks of records”, “close writing” functions. If that would be useful, feel free to open an issue and I can look into it. We have something similar like this for Arrow.write already (allowing you to append a recordbatch to an existing file).

Alternatively, Avro.writetable does support the Tables.partitions interface on inputs, so you could restructure your writing so the dicts are generated via Tables.partitioner, something like:

Avro.writetable(
    "data.avro",
    Tables.partitioner(1:N) do i
        # generate dict here
        # but need to return a valid "table", simplest is Vector{NamedTuple}
    )
)

This means that each “partition” (generated dict) will be written out to the file, one at a time.

You can then read back in the data by doing tbl = Avro.readtable("data.avro"), which uses a lazy reading, so even larger-than-RAM files should be fine.

Feel free to open a JSON3.jl issue or post here with details; happy to help brainstorm the best approach here too.

For @quinnj and future readers following along, that JSON3 discussion is in