Will the new DataFrames be memory mapped?

In Julia, large Arrays can be memory mapped using the function Mmap.mmap(type::Type{Array{T, N}}, dims).
But I have to work with large DataFrames (I have to keep columns and rows names and often NA values) in geno-transcriptomics. The DataFrame cannot be analyzed by slices and have to be processed entirely, for example to do quantile normalization (https://en.wikipedia.org/wiki/Quantile_normalization).

Sometimes one DataFrame fit in RAM but the calculations needs to store several tables with intermediate results (the orginal table + ranked table + sorted table in several ways…) to be combined together.

So I am interested to know if you know a good stategy to process large DataFrames or if it will be possible to use memory mapped DataFrames in the future ?

Thanks for comments !

2 Likes

There are no such plans currently AFAIK. An alternative approach would be to keep the data in a database and process it on the fly, for example using the OnlineStats and DataStreams packages.

1 Like

I’m not sure using a database will work for this use case as many databases don’t support these kinds of quantile operations, which generally need O(n_rows) memory to perform a computation without introducing probabilistic approximations.

Thank you very much for your comments !

What about distributed arrays backed by some persistent storage like hdf5?

I have no experience of this strategy, if you know link to a tutorial I am interested. But remember that I have to deal with rows names, col names and NA which is generally not possible with arrays of Floats… Thanks !

One option is to directly use mmapped arrays as columns. You’ll have to do some work manually. To handle NA’s you could use NaN’s as NA’s with the NaNMath package, or write your own functions that treat NaN’s as NA’s.

Another option is to use NullableArrays backed by mmapped arrays. This also may work with DataArrays, but I didn’t try it.

In either of these cases, you’ll need to watch for copying and try to do as much in place as possible. Here’s how to set up one of each:

N = 100
d = DataFrame(Any[ Mmap.mmap(Array{Float64,1}, (N,)), 
                   NullableArray( Mmap.mmap(Array{Float64,1}, (N,))) ], 
              [:array_col, :nullable_col])

Thank you for this interesting solution !

With AbstractTables.jl defining the interface, you should be able to create your own types which implement the same table interface as DataFrames but have different workings in the backend. (I believe that’s how it is given what I’ve read about it as an outsider, @johnmyleswhite could probably expand or correct this.)

This will be a pretty useful feature when applying Julia to different fields, and treating with large data of course.
For now we mostly use CSV.read |> DataFrame to load big data. This cost unnecessary memory, and made it hard to be deployed online. (like web apps / online service apis / treating video datastream)
I’m using swap/optane to avoid overflow, still hoping there’ll be a memory-mapped db/df better than sql invoke.

I don’t have an memory mapped. But JDF.jl might work for you use-case. You can save a dataframe as JDF and after you can load it column by column.

YOu can also use it like a DataFrame directly rom disk. E.g.

using JDF
df = JDFFile("path/to/your/jdf_file.jdf")
df[!, :col1]

A JDFFile is Tables.jl-compatible column-accessible in the latest JDF.jl release which is 0.2.8.

1 Like

If you do a lot of out of memory work, Spark can be useful. I’ve been using it from Python, and I noticed a Julia package is available, but I’ve not used it before so I’m not sure it’s maturity

The PySpark bindings a good I know.