Using JuliaDB to create larger than memory datasets and work with them?

JuliaDB docs explain how to perform some basic operations out-of-core, with data larger than memory.
http://juliadb.org/latest/manual/out-of-core.html
But it seems it can only load large data and produce small results.

How can I get large results too (not only the input)?
I’m using Julia 1.0.2 on Windows 10.

Imagine I want to do something like this:

using DataFrames
N=3
myDT = DataFrame(group = repeat('A':'C',outer=N), x = 1:(3*N) ) # create a dataframe
myDT.y = myDT.x .* rand(3*N) # add a new column z
myDT[myDT.group .== 'A', :y] = 0 # Replace y values when group == 'A'

but with much larger N, too large to fit on memory. (Or for example create two large matrices, multiply them and save the result).

How can I do it with JuliaDB for N larger than 10^9 and save it on disk?

I’ve tried

using JuliaDB
N=10^9
table((group = repeat('A':'C',outer=N), x = 1:(3*N) ))

but it consumes all my RAM and produces the error
ERROR: OutOfMemoryError()

Folks, I now that is late, but, any news on this post ?? :upside_down_face:

AFAICT, this is not at all a JuliaDB issue. Try just running repeat('A':'C',outer=10^9) on a system that has only 4GB of RAM; it will likely throw the exact same error, because it allocates ~10GB of memory. That data must first be materialized before it even gets to a JuliaDB function.

Maybe the way to handle this is to make an iterator and have table support “unrolling” iterables while it writes them to disk; I’m not sure if that’s currently supported, but I doubt it’d be a hard PR.

1 Like

Maybe the way to handle this is to make an iterator and have table support “unrolling” iterables while it writes them to disk; I’m not sure if that’s currently supported, but I doubt it’d be a hard PR.

Exactly this point of iterates that I was wandering if someone already did a PR, or if someone have a hack to share.