Will the new DataFrames be memory mapped?

Fred · December 18, 2016, 1:17pm

In Julia, large Arrays can be memory mapped using the function Mmap.mmap(type::Type{Array{T, N}}, dims).
But I have to work with large DataFrames (I have to keep columns and rows names and often NA values) in geno-transcriptomics. The DataFrame cannot be analyzed by slices and have to be processed entirely, for example to do quantile normalization (Quantile normalization - Wikipedia).

Sometimes one DataFrame fit in RAM but the calculations needs to store several tables with intermediate results (the orginal table + ranked table + sorted table in several ways…) to be combined together.

So I am interested to know if you know a good stategy to process large DataFrames or if it will be possible to use memory mapped DataFrames in the future ?

Thanks for comments !

nalimilan · December 18, 2016, 5:20pm

There are no such plans currently AFAIK. An alternative approach would be to keep the data in a database and process it on the fly, for example using the OnlineStats and DataStreams packages.

johnmyleswhite · December 18, 2016, 5:45pm

I’m not sure using a database will work for this use case as many databases don’t support these kinds of quantile operations, which generally need O(n_rows) memory to perform a computation without introducing probabilistic approximations.

Fred · December 18, 2016, 6:26pm

Thank you very much for your comments !

Tem_Pl · December 18, 2016, 8:03pm

What about distributed arrays backed by some persistent storage like hdf5?

Fred · December 18, 2016, 8:18pm

I have no experience of this strategy, if you know link to a tutorial I am interested. But remember that I have to deal with rows names, col names and NA which is generally not possible with arrays of Floats… Thanks !

tshort · December 18, 2016, 11:26pm

One option is to directly use mmapped arrays as columns. You’ll have to do some work manually. To handle NA’s you could use NaN’s as NA’s with the NaNMath package, or write your own functions that treat NaN’s as NA’s.

Another option is to use NullableArrays backed by mmapped arrays. This also may work with DataArrays, but I didn’t try it.

In either of these cases, you’ll need to watch for copying and try to do as much in place as possible. Here’s how to set up one of each:

N = 100
d = DataFrame(Any[ Mmap.mmap(Array{Float64,1}, (N,)), 
                   NullableArray( Mmap.mmap(Array{Float64,1}, (N,))) ], 
              [:array_col, :nullable_col])

Fred · December 19, 2016, 7:37am

Thank you for this interesting solution !

ChrisRackauckas · December 19, 2016, 7:41am

With AbstractTables.jl defining the interface, you should be able to create your own types which implement the same table interface as DataFrames but have different workings in the backend. (I believe that’s how it is given what I’ve read about it as an outsider, @johnmyleswhite could probably expand or correct this.)

Jason89757 · November 26, 2019, 5:05am

This will be a pretty useful feature when applying Julia to different fields, and treating with large data of course.
For now we mostly use CSV.read |> DataFrame to load big data. This cost unnecessary memory, and made it hard to be deployed online. (like web apps / online service apis / treating video datastream)
I’m using swap/optane to avoid overflow, still hoping there’ll be a memory-mapped db/df better than sql invoke.

xiaodai · November 26, 2019, 5:22am

I don’t have an memory mapped. But JDF.jl might work for you use-case. You can save a dataframe as JDF and after you can load it column by column.

YOu can also use it like a DataFrame directly rom disk. E.g.

using JDF
df = JDFFile("path/to/your/jdf_file.jdf")
df[!, :col1]

A JDFFile is Tables.jl-compatible column-accessible in the latest JDF.jl release which is 0.2.8.

venuur · January 20, 2020, 12:05am

If you do a lot of out of memory work, Spark can be useful. I’ve been using it from Python, and I noticed a Julia package is available, but I’ve not used it before so I’m not sure it’s maturity

The PySpark bindings a good I know.

Topic		Replies	Views
Data set size in DataFrames with Vector{T, Missing} Data	3	1080	April 1, 2018
Efficiently using single large dataframe over multiple workers Performance	10	2449	June 15, 2018
Should I use either Dataframes.jl or Named Array for a long and wide array for sci computing General Usage	7	2627	July 25, 2019
DataFrames in Master (with NullableArrays) may use memory wastefully General Usage	9	1162	November 29, 2016
What's the best way to work with millions of rows of data? Performance	7	2127	February 24, 2020

Will the new DataFrames be memory mapped?

Related topics