Package for reading/writing ~100GB data files

Juan · November 13, 2018, 1:46am

Then, what is nowadays the fastest package/function/options to read and write say 100GB data with julia?
For example with 10000 columns. Each column with all elements of the same type and a header.
It would be great to be able to read and write directly compressed files.

I would like to find on that table the columns or the rows containing a certain value and read them to calculate something else, for example the median o perform a regression.
Up to now I was using R’s data.table function fread for smaller sets and I had to use sqlite for larger datasets.

mbauman · November 13, 2018, 2:00am

I moved this to its own topic (instead of the 2-year old JuliaDB announcement thread)

xiaodai · November 13, 2018, 2:38am

That’s the last time i looked at it for CSV Benchmarking ways to write/load DataFrames IndexedTables to disk

Yifan_Liu · November 13, 2018, 2:24am

R data.table will be able to deal with out-of-memory data in the near future so you do not need to use external database. In Julia, the current feasible way to read large data sets is JuliaDB.

xiaodai · November 13, 2018, 10:23pm

Where did you get this info? Also given there is no easy way to convert data from R to Julia. Does this help with reading data into Julia?

Yifan_Liu · November 13, 2018, 10:51pm

I chatted with a dev guy of RDatatable a couple of days ago, and he told me that the out-of-memory functionality will be available in the near future. Since data.table is written in C, I guess it can be transferred to Julia without too much hassle. Now they have a Python version of data.table

Juan · November 14, 2018, 12:27am

Anyway I guess only some operations such as reading rows, columns, filtering will be available, like in a traditional database.
The future of scientific computation will be to be able to perform all kind of computations directly on disk, for example a regression with random effects or MCMC with a dataset much larger than memory.

xiaodai · November 14, 2018, 3:19am

JuliaDB.jl is meant to have lots of online algorithms to do that.

Tamas_Papp · November 14, 2018, 6:36am

AFAIK data.table only needs to deal with the types allowed by R (integer, float, categorical, string, date, maybe I forgot something?). Writing an efficient implementation for a restricted number of bits types is considerably easier than handling generic types, which may or may not be bits types.

MA_Laforge · November 17, 2018, 1:39am

My new prefered format for saving large datasets is HDF5. To read/write in this format in Julia, you can use the package found here:

It can be installed with the following add statement:

] pkg add HDF5

It appears the HDF format was originally developed at the “National Center for Supercomputing Applications”, presumably for dealing with large amounts of data. You might find the Wiki page interesting:

Note that HDF5 supports the use of compression algorithms, but I have not personally used them.

Good to save matrices of a single data type

This format excels at writing large matrices of data (not typically what people refer to as “tables” of data, though). In other words, it is great if all elements of a “table” are of the same data type (ex: all Float64).

Saving columns of different data types

I believe it is possible for HDF5 to natively define a table format with different types for different rows… but I personally have not used this feature much. That being said, I believe that what you call a “table” is called a “Compound Dataset” in HDF5-speak.

Saving columns of different data types: A hack

Note that HDF5 is a Hierarchical data storage format… so, if you wish to save columns that each store data of different types, you could, in theory manually break down your “table” into rows that are each stored as different “subfolders” of your table. For example:

mytable1\key: stores vector (int)
mytable1\name: stores vector (string)
mytable1\weight: stores vector (float64)
mytable1\height: stores vector (float64)
...

Useful tools

The HDF group provides a viewer tool that lets you browse the contents of an HDF5 file. It basically makes the file format feel somewhat like a text file… because you can always look at the “raw” binary data with this tool:

You can even start seeding an HDF5 file using this GUI-based tool - or simply modify the contents of a pre-existing dataset.

Juan · November 17, 2018, 1:10pm

What if I want to fit a regression model with random effects (or a survival analysis) with data that doesn’t fit on memory?
Can JuliaDB.jl do it, or maybe OnlineStats.jl or combining MixedModels with any of them?
Some day I will open a new thread with an example to see different solutions.

Topic		Replies	Views
JuliaDB out-of-memory computations New to Julia	2	527	December 6, 2018
Importing big data General Usage question	21	5520	November 14, 2017
ANN: JuliaDB.jl Community	40	9849	November 13, 2018
Using JuliaDB to create larger than memory datasets and work with them? General Usage	3	1071	October 15, 2019
Julia cookbook available New to Julia	13	1493	April 22, 2019

Package for reading/writing ~100GB data files

Good to save matrices of a single data type

Saving columns of different data types

Saving columns of different data types: A hack

Useful tools

Related topics