What data structure to use to hold large categorial dataset for analytics?

ilanggear · March 13, 2017, 10:34pm

I have a dataset that is approximately 1 million rows by 40 columns of mostly categorical values, and I need to hold it in memory while I build indexes (themselves needing lots of memory) and do some ad-hoc analytics on it. I’m wondering what type of data structure I should use. I can compress it by mapping strings to ints or to bitfields, but I’m not sure what the tradeoff will be in terms of size vs speed, or vs. programming complexity. Can anyone advise me on this? Should I use Bitarrays, DataTables, or build my own custom structure? Eventually (in a year or so) I will want to scale up to much larger datasets, so developing something that will be a step in that direction will be preferred. Any good advice will be much appreciated!!

Tamas_Papp · March 14, 2017, 7:04am

I did something similar recently (irregular data in a “ragged array”). Encoded it as Enums, to minimize data interpretation errors, which was the tedious step, but after that it was very easy to work with. Put it in a vector of vectors (of structs).

I may have misunderstood something, but 1e6 x 40 records is not “large” on today’s computers. Should be very fast, and coding time will most likely dominate execution time. So do what is easiest.

DNF · March 14, 2017, 8:23am

I recently came across PooledDataArrays: GitHub - JuliaStats/DataArrays.jl: DEPRECATED: Data structures that allow missing values

I haven’t used it, but the readme.md says:

When working with categorical data sets in which a large number of data points occur, but only take on a limited set of unique values, we provide an analog to DataArray that is optimized for efficient memory usage: PooledDataArray.

nalimilan · March 14, 2017, 8:53am

I would say the standard structure for this kind of data is a DataFrame with PooledDataArray columns, or a DataTable with CategoricalArray columns. Both a very similar (and the memory layout of the underlying integer codes is the same for both types), but DataFrames is a more stable package while DataTables is under more active development (and therefore not completely stabilized).

Topic		Replies	Views
Which are efficient data structures for querying data by name? Data question , performance , data , array	2	833	October 14, 2017
DataFrames 0.11 released Data announcement	27	11444	December 19, 2017
Data set size in DataFrames with Vector{T, Missing} Data	3	1049	April 1, 2018
Best way to store a dataset with specific structure New to Julia serialization	7	228	January 27, 2025
Faster conversion to CategoricalArray with many groups Performance	3	252	February 11, 2021

What data structure to use to hold large categorial dataset for analytics?

Related topics