Recommendations for larger than memory data

RobertGregg · May 15, 2025, 12:14am

I’m starting to work with a large dataset that does not fit into memory. The data is approximately 5000 samples, 1 million features, and two time points. The features are continuous values that range from 0 to 1. I’m trying to decide how best to deal with this data and to see if anyone has any recommendations. My goal is to compare dimension reduction techniques (PCA, NMF, auto-encoders, random projections, etc.) and eventually build some predictive models. I’m considering:

HDF5.jl
Memory mapping to one large .bin file
DuckDB.jl

I’m not sure, for example, if DuckDB is overkill for my use case or if HDF5 will be slower/faster than memory mapping.

CameronBieganek · May 15, 2025, 12:32am

That seems small enough that you could consider using an EC2 instance with plenty of memory on AWS.

It might be challenging to find an implementation of PCA, etc, that works on larger-than-memory data.

RobertGregg · May 15, 2025, 1:08am

I do have access to computing resources through my university so that or the EC2 instance you mention are both good options.

I think there are PCA methods that use random SVD to handle large datasets (?). Still something I have to look into though.

ufechner7 · May 15, 2025, 2:48am

Did you consider to use Arrow.jl? From the docu:

When “reading” the arrow data, Arrow.Table first “mmapped” the data.arrow file, which is an important technique for dealing with data larger than available RAM on a system. By “mmapping” a file, the OS doesn’t actually load the entire file contents into RAM at the same time, but file contents are “swapped” into RAM as different regions of a file are requested.

RobertGregg · May 15, 2025, 4:55am

Ah, this sounds similar to Mmap in Julia, but works across multiple languages. I will have to try it out, thank you!

Topic		Replies	Views
What software/packages/databases to work with 100GB datasets? Offtopic	12	3553	December 18, 2018
XGBoostClassifier on bigger than RAM database Machine Learning memory-allocation , machine-learning , mlj	2	430	January 8, 2024
Storing huge amount of data efficiently Performance performance , jld2 , numerics , io , arrow	15	2694	February 24, 2023
Will the new DataFrames be memory mapped? Data question , package	11	2365	January 20, 2020
Efficiently merge multiple .arrow files (12 files totaling 2.5 TB) into a single DataFrame for effective data analysis while minimizing memory overhead Performance question , dataframes	14	796	September 22, 2023

Recommendations for larger than memory data

Related topics