What software/packages/databases to work with 100GB datasets?

I need to work with medical datasets of 10GB to 100GB (15000 columns).
I need to reshape all the data converting most of the columns from wide to long format.
And later I will cleanse the data (it’s plenty of errors and missings), transform some columns, add new columns, remove rows, filter data and then do multiple imputation and several statistical analysis such as mixed-effect models and survival analysis.
How do you suggest to do it?

Last time, with a smaller (5GB) dataset I used R’s data.table for the cleansing and reshape operation, but I needed to program myself a script and do it by chunks and append the results. It was very slow and prone to errors.

I can’t buy more memory, and anyway sooner or later it would reach a limit.

I think there are several options:

  • Using a traditional or NoSQL database, they can deal with large datasets but they have few functionalities. When you feed them with columns plenty of errors and missings they turn crazy. I like they can work with compressed data.
    What database would you suggest? Maybe SciDB or MonetDB? Maybe directly with Feather?
  • Using tools such as Julia, R, Stata… they don’t work well with large datasets. I think JuliaDB is not able to reshape data out-of-core nor run a mixedeffects model.
  • Use mmap or other swapping strategy (suchas as ff). I’m on Windows and I don’t know about a good and fast way to do it.
  • Use Spark, Flink, Dask: I have almost no experience, I think they are difficult and can only do some basic operations.

What would you suggest?
My computer in the university has 16GB RAM and 1TB SSD.

I did something like that in R once and I feel for you. It truly was a pain. If you are looking to do some local disk mapped stuff plus some database level stuff Arrow may be a good bet. I’ve yet to test it (waiting for the right project) but it looks really promising and may have an easier way of grabbing chunks of data than HDF5. There’s also OnlineStats.jl for the analysis side of things.

What is the format of your data (is it a csv?)

In pandas you can do this for example:

import pandas as pd
for chunk in pd.read_csv(filepath, chunksize=chunk_size):
   do_whatever(chunk)

I’m not sure how is the current dataframe ecosystem in Julia but probably there are options to do something along those lines.

In Python there was a proprietary tool, the SFrame which allowed you to do what you want. It was quite unique. Then Apple bought the company and the project was abandoned.
In any case the philosophy behind HDF5 is quite similar so it might be worth trying it.

1 Like

Try MonetDB. It is the fastest DB to my personal experience and should have no problem handling 100 GB data. I have dealt with data as large as 300 GB.

Actually you can use R package foreach in combination with data.table to handle your data. My laptop has 16G ram and I can handle a data of about 170G with them.

JuliaDB.jl but it’s not ready yet. In the mean time you can try disk.frame https://github.com/xiaodaigh/disk.frame and do a comparison with JuliaDB.jl

3 Likes

At the moment, this is your best option, or coding a custom solution (eg using mmap).

I work with such data sizes on a daily basis and the HDF5 format fits nicely. Both Python and Julia can handle these files perfectly but you need to write some low level routines to get the most out of the whole system. It’s btw. also a bit of work to put the data in the right HDF5 structure, but these are basic problems in big data :wink:

That was my candidate because you can run the server both in Linux and Windows and it’s supposed to be fast.

How do you compare with HDF5 in terms of easiness and functionality?

I guess it’s not possible to reshape (pivoting from long to wide or vice versa) data with HDF5 or MonetDB out-of-the-box nor through Julia commands. I would need to reimplement the functions myself to work by chunks with two datasets on disk, the input and the output. Isn’t it?

Amazon EC2 is pretty cheap:

I’m looking for a solution that doesn’t cost me money.

Haven’t tried myself but perhaps Vaex?

I’m sure it won’t be long until JuliaDB is the best choice but yeah, not there just yet.

Btw, Spark can do a lot more than “basic operations” but it is indeed probably not the best choice here.

In my opinion Spark is much more difficult.

What kind of things will be possible with the “new” JuliaDB and what won’t be possible?
How will it compare to Vaex or Dask?