What software/packages/databases to work with 100GB datasets?

Juan · December 17, 2018, 1:21am

I need to work with medical datasets of 10GB to 100GB (15000 columns).
I need to reshape all the data converting most of the columns from wide to long format.
And later I will cleanse the data (it’s plenty of errors and missings), transform some columns, add new columns, remove rows, filter data and then do multiple imputation and several statistical analysis such as mixed-effect models and survival analysis.
How do you suggest to do it?

Last time, with a smaller (5GB) dataset I used R’s data.table for the cleansing and reshape operation, but I needed to program myself a script and do it by chunks and append the results. It was very slow and prone to errors.

I can’t buy more memory, and anyway sooner or later it would reach a limit.

I think there are several options:

Using a traditional or NoSQL database, they can deal with large datasets but they have few functionalities. When you feed them with columns plenty of errors and missings they turn crazy. I like they can work with compressed data.
What database would you suggest? Maybe SciDB or MonetDB? Maybe directly with Feather?
Using tools such as Julia, R, Stata… they don’t work well with large datasets. I think JuliaDB is not able to reshape data out-of-core nor run a mixedeffects model.
Use mmap or other swapping strategy (suchas as ff). I’m on Windows and I don’t know about a good and fast way to do it.
Use Spark, Flink, Dask: I have almost no experience, I think they are difficult and can only do some basic operations.

What would you suggest?
My computer in the university has 16GB RAM and 1TB SSD.

Zach_Christensen · December 17, 2018, 1:53am

I did something like that in R once and I feel for you. It truly was a pain. If you are looking to do some local disk mapped stuff plus some database level stuff Arrow may be a good bet. I’ve yet to test it (waiting for the right project) but it looks really promising and may have an easier way of grabbing chunks of data than HDF5. There’s also OnlineStats.jl for the analysis side of things.

davidbp · December 17, 2018, 1:58am

What is the format of your data (is it a csv?)

In pandas you can do this for example:

import pandas as pd
for chunk in pd.read_csv(filepath, chunksize=chunk_size):
   do_whatever(chunk)

I’m not sure how is the current dataframe ecosystem in Julia but probably there are options to do something along those lines.

In Python there was a proprietary tool, the SFrame which allowed you to do what you want. It was quite unique. Then Apple bought the company and the project was abandoned.
In any case the philosophy behind HDF5 is quite similar so it might be worth trying it.

Yifan_Liu · December 17, 2018, 2:21am

Try MonetDB. It is the fastest DB to my personal experience and should have no problem handling 100 GB data. I have dealt with data as large as 300 GB.

Actually you can use R package foreach in combination with data.table to handle your data. My laptop has 16G ram and I can handle a data of about 170G with them.

xiaodai · December 17, 2018, 2:26am

JuliaDB.jl but it’s not ready yet. In the mean time you can try disk.frame https://github.com/xiaodaigh/disk.frame and do a comparison with JuliaDB.jl

Tamas_Papp · December 17, 2018, 6:45am

At the moment, this is your best option, or coding a custom solution (eg using mmap).

tamasgal · December 17, 2018, 8:47am

I work with such data sizes on a daily basis and the HDF5 format fits nicely. Both Python and Julia can handle these files perfectly but you need to write some low level routines to get the most out of the whole system. It’s btw. also a bit of work to put the data in the right HDF5 structure, but these are basic problems in big data

Juan · December 17, 2018, 11:03am

That was my candidate because you can run the server both in Linux and Windows and it’s supposed to be fast.

How do you compare with HDF5 in terms of easiness and functionality?

Juan · December 17, 2018, 11:17am

I guess it’s not possible to reshape (pivoting from long to wide or vice versa) data with HDF5 or MonetDB out-of-the-box nor through Julia commands. I would need to reimplement the functions myself to work by chunks with two datasets on disk, the input and the output. Isn’t it?

matthieu · December 17, 2018, 3:18pm

Amazon EC2 is pretty cheap:

Juan · December 17, 2018, 4:34pm

I’m looking for a solution that doesn’t cost me money.

ValdarT · December 18, 2018, 7:01pm

Haven’t tried myself but perhaps Vaex?

I’m sure it won’t be long until JuliaDB is the best choice but yeah, not there just yet.

Btw, Spark can do a lot more than “basic operations” but it is indeed probably not the best choice here.

Juan · December 18, 2018, 7:26pm

In my opinion Spark is much more difficult.

What kind of things will be possible with the “new” JuliaDB and what won’t be possible?
How will it compare to Vaex or Dask?

Topic		Replies	Views
Package for reading/writing ~100GB data files General Usage	10	2880	November 17, 2018
Struggling with Julia and large datasets General Usage question , big-data	67	11053	October 17, 2024
Recommendations for larger than memory data General Usage big-data	4	145	May 15, 2025
Disk based data manipulation framework needed Data data	22	3997	November 19, 2018
Any fast database working with Julia? Data	3	954	November 25, 2018

What software/packages/databases to work with 100GB datasets?

Related topics