Package for reading/writing ~100GB data files


#1

Then, what is nowadays the fastest package/function/options to read and write say 100GB data with julia?
For example with 10000 columns. Each column with all elements of the same type and a header.
It would be great to be able to read and write directly compressed files.

I would like to find on that table the columns or the rows containing a certain value and read them to calculate something else, for example the median o perform a regression.
Up to now I was using R’s data.table function fread for smaller sets and I had to use sqlite for larger datasets.


ANN: JuliaDB.jl
#2

I moved this to its own topic (instead of the 2-year old JuliaDB announcement thread)


#3

That’s the last time i looked at it for CSV Benchmarking ways to write/load DataFrames IndexedTables to disk


#4

R data.table will be able to deal with out-of-memory data in the near future so you do not need to use external database. In Julia, the current feasible way to read large data sets is JuliaDB.


ANN: JuliaDB.jl
#5

Where did you get this info? Also given there is no easy way to convert data from R to Julia. Does this help with reading data into Julia?


#6

I chatted with a dev guy of RDatatable a couple of days ago, and he told me that the out-of-memory functionality will be available in the near future. Since data.table is written in C, I guess it can be transferred to Julia without too much hassle. Now they have a Python version of data.table


#7

Anyway I guess only some operations such as reading rows, columns, filtering will be available, like in a traditional database.
The future of scientific computation will be to be able to perform all kind of computations directly on disk, for example a regression with random effects or MCMC with a dataset much larger than memory.


#8

JuliaDB.jl is meant to have lots of online algorithms to do that.


#9

AFAIK data.table only needs to deal with the types allowed by R (integer, float, categorical, string, date, maybe I forgot something?). Writing an efficient implementation for a restricted number of bits types is considerably easier than handling generic types, which may or may not be bits types.


#10

My new prefered format for saving large datasets is HDF5. To read/write in this format in Julia, you can use the package found here:

It can be installed with the following add statement:

] pkg add HDF5

It appears the HDF format was originally developed at the “National Center for Supercomputing Applications”, presumably for dealing with large amounts of data. You might find the Wiki page interesting:

Note that HDF5 supports the use of compression algorithms, but I have not personally used them.

Good to save matrices of a single data type

This format excels at writing large matrices of data (not typically what people refer to as “tables” of data, though). In other words, it is great if all elements of a “table” are of the same data type (ex: all Float64).

Saving columns of different data types

I believe it is possible for HDF5 to natively define a table format with different types for different rows… but I personally have not used this feature much. That being said, I believe that what you call a “table” is called a “Compound Dataset” in HDF5-speak.

Saving columns of different data types: A hack

Note that HDF5 is a Hierarchical data storage format… so, if you wish to save columns that each store data of different types, you could, in theory manually break down your “table” into rows that are each stored as different “subfolders” of your table. For example:

mytable1\key: stores vector (int)
mytable1\name: stores vector (string)
mytable1\weight: stores vector (float64)
mytable1\height: stores vector (float64)
...

Useful tools

The HDF group provides a viewer tool that lets you browse the contents of an HDF5 file. It basically makes the file format feel somewhat like a text file… because you can always look at the “raw” binary data with this tool:

https://support.hdfgroup.org/products/java/hdfview/

You can even start seeding an HDF5 file using this GUI-based tool - or simply modify the contents of a pre-existing dataset.


#11

What if I want to fit a regression model with random effects (or a survival analysis) with data that doesn’t fit on memory?
Can JuliaDB.jl do it, or maybe OnlineStats.jl or combining MixedModels with any of them?
Some day I will open a new thread with an example to see different solutions.