I’m starting to work with a large dataset that does not fit into memory. The data is approximately 5000 samples, 1 million features, and two time points. The features are continuous values that range from 0 to 1. I’m trying to decide how best to deal with this data and to see if anyone has any recommendations. My goal is to compare dimension reduction techniques (PCA, NMF, auto-encoders, random projections, etc.) and eventually build some predictive models. I’m considering:
- HDF5.jl
- Memory mapping to one large .bin file
- DuckDB.jl
I’m not sure, for example, if DuckDB is overkill for my use case or if HDF5 will be slower/faster than memory mapping.
1 Like
That seems small enough that you could consider using an EC2 instance with plenty of memory on AWS.
It might be challenging to find an implementation of PCA, etc, that works on larger-than-memory data.
3 Likes
I do have access to computing resources through my university so that or the EC2 instance you mention are both good options.
I think there are PCA methods that use random SVD to handle large datasets (?). Still something I have to look into though.
2 Likes
Did you consider to use Arrow.jl? From the docu:
When “reading” the arrow data, Arrow.Table
first “mmapped” the data.arrow
file, which is an important technique for dealing with data larger than available RAM on a system. By “mmapping” a file, the OS doesn’t actually load the entire file contents into RAM at the same time, but file contents are “swapped” into RAM as different regions of a file are requested.
3 Likes
Ah, this sounds similar to Mmap in Julia, but works across multiple languages. I will have to try it out, thank you!