What data structure to use to hold large categorial dataset for analytics?


I have a dataset that is approximately 1 million rows by 40 columns of mostly categorical values, and I need to hold it in memory while I build indexes (themselves needing lots of memory) and do some ad-hoc analytics on it. I’m wondering what type of data structure I should use. I can compress it by mapping strings to ints or to bitfields, but I’m not sure what the tradeoff will be in terms of size vs speed, or vs. programming complexity. Can anyone advise me on this? Should I use Bitarrays, DataTables, or build my own custom structure? Eventually (in a year or so) I will want to scale up to much larger datasets, so developing something that will be a step in that direction will be preferred. Any good advice will be much appreciated!!


I did something similar recently (irregular data in a “ragged array”). Encoded it as Enums, to minimize data interpretation errors, which was the tedious step, but after that it was very easy to work with. Put it in a vector of vectors (of structs).

I may have misunderstood something, but 1e6 x 40 records is not “large” on today’s computers. Should be very fast, and coding time will most likely dominate execution time. So do what is easiest.


I recently came across PooledDataArrays: https://github.com/JuliaStats/DataArrays.jl#pooleddataarrays

I haven’t used it, but the readme.md says:

When working with categorical data sets in which a large number of data points occur, but only take on a limited set of unique values, we provide an analog to DataArray that is optimized for efficient memory usage: PooledDataArray.


I would say the standard structure for this kind of data is a DataFrame with PooledDataArray columns, or a DataTable with CategoricalArray columns. Both a very similar (and the memory layout of the underlying integer codes is the same for both types), but DataFrames is a more stable package while DataTables is under more active development (and therefore not completely stabilized).