I need to work with medical datasets of 10GB to 100GB (15000 columns).
I need to reshape all the data converting most of the columns from wide to long format.
And later I will cleanse the data (it’s plenty of errors and missings), transform some columns, add new columns, remove rows, filter data and then do multiple imputation and several statistical analysis such as mixed-effect models and survival analysis.
How do you suggest to do it?
Last time, with a smaller (5GB) dataset I used R’s data.table for the cleansing and reshape operation, but I needed to program myself a script and do it by chunks and append the results. It was very slow and prone to errors.
I can’t buy more memory, and anyway sooner or later it would reach a limit.
I think there are several options:
- Using a traditional or NoSQL database, they can deal with large datasets but they have few functionalities. When you feed them with columns plenty of errors and missings they turn crazy. I like they can work with compressed data.
What database would you suggest? Maybe SciDB or MonetDB? Maybe directly with Feather?
- Using tools such as Julia, R, Stata… they don’t work well with large datasets. I think JuliaDB is not able to reshape data out-of-core nor run a mixedeffects model.
- Use mmap or other swapping strategy (suchas as ff). I’m on Windows and I don’t know about a good and fast way to do it.
- Use Spark, Flink, Dask: I have almost no experience, I think they are difficult and can only do some basic operations.
What would you suggest?
My computer in the university has 16GB RAM and 1TB SSD.