Larger than memory table format with lazy reads?

sairus7 · August 17, 2020, 6:10pm

Let’s say, I have some storage with a table-like data in it. For example, it is a single big binary file with serialized array of StoredRow structures:

struct StoredRow
    a::Float64
    b::Float64
    c::UInt16
end

Also, I have a function to read chunks of data from storage given a range of row index and convert them into array of StoredRow structures:
function readfromstorage(storage, range::UnitRange{Int64})::Array{StoredRow}

If it was a small table, I could convert it into table (DataFrame, StructArray, or IndexedTable) and work with it directly:

# some small array:
vec = [StoredRow(rand(),rand(),i) for i = 1:100]

using DataFrames
df = DataFrame(vec)

using StructArrays, JuliaDB
s = StructArray(vec)
t = table(fieldarrays(s))

But how can I create a similar table object, if storage data does not fit in memory?
Can I simply attach data source (or some other abstract interface type) to a table object, so it can lazy read data from source in chunks (and maybe cache the last readed results)?

pdeffebach · August 17, 2020, 6:26pm

This is a canonical use-case for JuliaDB.

sairus7 · August 17, 2020, 7:21pm

Can you please provide a minimal example with lazy-loading using JuliaDB?
As I can see from its docs, there is loadtable only for CSV files, and no abstract datasource interfaces. And there is chunking based on distributed workers - how it can help in a task with one worker operating with table that does not fit into its RAM?
Or maybe I miss something?

pdeffebach · August 17, 2020, 7:40pm

I don’t know much about JuliaDB, sorry, only that this is one of it’s goals. Hopefully someone else can chime in and discuss whether or not I’m right about that.

xiaodai · August 18, 2020, 1:23am

Seems like there is a need for this. I am not sure how maintained JuliaDB.jl is.

I am building something.

But perhaps DiskArray with DataFrames.jl will already work

sairus7 · August 19, 2020, 12:33pm

Thanks for pointing me to DiskArrays.jl! As I can see, there is abstract interface for chunked arrays, so I can implement it based on my own storage format. Then, in theory I can build DataFrame or IndexedTable from columns of custom array types.

Topic		Replies	Views
Ingesting data to JuliaDB without .csv files Data question	4	1299	August 30, 2018
ANN: JuliaDB.jl Community	40	9788	November 13, 2018
Using JuliaDB to create larger than memory datasets and work with them? General Usage	3	1063	October 15, 2019
Tables.jl: columntable to rowtable Data	6	1537	May 4, 2020
Package for reading/writing ~100GB data files General Usage	10	2903	November 17, 2018

Larger than memory table format with lazy reads?

Related topics