Will the new DataFrames be memory mapped?



In Julia, large Arrays can be memory mapped using the function Mmap.mmap(type::Type{Array{T, N}}, dims).
But I have to work with large DataFrames (I have to keep columns and rows names and often NA values) in geno-transcriptomics. The DataFrame cannot be analyzed by slices and have to be processed entirely, for example to do quantile normalization (https://en.wikipedia.org/wiki/Quantile_normalization).

Sometimes one DataFrame fit in RAM but the calculations needs to store several tables with intermediate results (the orginal table + ranked table + sorted table in several ways…) to be combined together.

So I am interested to know if you know a good stategy to process large DataFrames or if it will be possible to use memory mapped DataFrames in the future ?

Thanks for comments !


There are no such plans currently AFAIK. An alternative approach would be to keep the data in a database and process it on the fly, for example using the OnlineStats and DataStreams packages.


I’m not sure using a database will work for this use case as many databases don’t support these kinds of quantile operations, which generally need O(n_rows) memory to perform a computation without introducing probabilistic approximations.


Thank you very much for your comments !


What about distributed arrays backed by some persistent storage like hdf5?


I have no experience of this strategy, if you know link to a tutorial I am interested. But remember that I have to deal with rows names, col names and NA which is generally not possible with arrays of Floats… Thanks !


One option is to directly use mmapped arrays as columns. You’ll have to do some work manually. To handle NA’s you could use NaN’s as NA’s with the NaNMath package, or write your own functions that treat NaN’s as NA’s.

Another option is to use NullableArrays backed by mmapped arrays. This also may work with DataArrays, but I didn’t try it.

In either of these cases, you’ll need to watch for copying and try to do as much in place as possible. Here’s how to set up one of each:

N = 100
d = DataFrame(Any[ Mmap.mmap(Array{Float64,1}, (N,)), 
                   NullableArray( Mmap.mmap(Array{Float64,1}, (N,))) ], 
              [:array_col, :nullable_col])


Thank you for this interesting solution !


With AbstractTables.jl defining the interface, you should be able to create your own types which implement the same table interface as DataFrames but have different workings in the backend. (I believe that’s how it is given what I’ve read about it as an outsider, @johnmyleswhite could probably expand or correct this.)