Larger than memory table format with lazy reads?

Let’s say, I have some storage with a table-like data in it. For example, it is a single big binary file with serialized array of StoredRow structures:

struct StoredRow

Also, I have a function to read chunks of data from storage given a range of row index and convert them into array of StoredRow structures:
function readfromstorage(storage, range::UnitRange{Int64})::Array{StoredRow}

If it was a small table, I could convert it into table (DataFrame, StructArray, or IndexedTable) and work with it directly:

# some small array:
vec = [StoredRow(rand(),rand(),i) for i = 1:100]

using DataFrames
df = DataFrame(vec)

using StructArrays, JuliaDB
s = StructArray(vec)
t = table(fieldarrays(s))

But how can I create a similar table object, if storage data does not fit in memory?
Can I simply attach data source (or some other abstract interface type) to a table object, so it can lazy read data from source in chunks (and maybe cache the last readed results)?

This is a canonical use-case for JuliaDB.

1 Like

Can you please provide a minimal example with lazy-loading using JuliaDB?
As I can see from its docs, there is loadtable only for CSV files, and no abstract datasource interfaces. And there is chunking based on distributed workers - how it can help in a task with one worker operating with table that does not fit into its RAM?
Or maybe I miss something?

I don’t know much about JuliaDB, sorry, only that this is one of it’s goals. Hopefully someone else can chime in and discuss whether or not I’m right about that.

Seems like there is a need for this. I am not sure how maintained JuliaDB.jl is.

I am building something.

But perhaps DiskArray with DataFrames.jl will already work

Thanks for pointing me to DiskArrays.jl! As I can see, there is abstract interface for chunked arrays, so I can implement it based on my own storage format. Then, in theory I can build DataFrame or IndexedTable from columns of custom array types.