I have streaming structured data that I’d like to write to disk either row-by-row or with a small amount of buffering.
Is JuliaDB appropriate for this? I ultimately will want to do some data manipulation on the resulting tables using JuliaDB, so it would be nice if I could keep everything in one format. After a brief search, I couldn’t find a way to append to an on-disk table, only rewrite the entire table, which would be extremely inefficient (incoming data will arrive at around 1-2MB/s).
I refactored the JuliaDB load code a bit (https://github.com/JuliaData/JuliaDB.jl/pull/365) to
- read a chunk in a
nd::NDSparse
and then
- merge! with
dnd::JuliaDB.DNDSparse
:
merge!(dnd, nd; output=x.output)
Performance is really good on my SSD laptop, at least writing about 1-2 mio rows/sec with 3 Int64 columns.
If chunks are small, many chunk files will be created, which I rechunk!
in a finalizing compression.
Also, PR https://github.com/JuliaData/JuliaDB.jl/pull/288 is discussing a csv approach.