[ANN] A new lightning fast package for data manipulation in pure Julia

sl-solution · March 21, 2022, 7:35am

I am excited to announce a new package for data manipulation in pure Julia.

Introduction

InMemoryDatasets.jl is a multi-threaded package for data manipulation and is designed for Julia 1.6+ (64bit OS). The core computation engine of the package is a set of customised algorithms developed specifically for columnar tables. The package performance is tuned with two goals in mind, a) low overhead of allowing missing values everywhere, and b) the following priorities - in order of importance:

Low compilation time
Memory efficiency
High performance

I tried to keep the overall complexity of the package as low as possible to simplify:

the maintenance of the package
adding new features to the package
contributing to the package

Features

InMemoryDatasets.jl has many interesting features which I here highlight some of my favourites (in no particular order):

Assigning a named function to a column as its format
- By default, formatted values are used for operations like: displaying, sorting, grouping, joining,…
- Format evaluation is lazy
- Formats don’t change the actual values
Multi-threading across the whole package
- Most functions in InMemoryDatasets.jl exploit all cores available to Julia by default
- Disabling parallel computation via passing the threads = false keyword argument to functions
Powerful row-wise operations
- Support many common operations
- Specialised operations for modifying columns
- Customised row-wise operations for filtering observations / filter simply wraps byrow
Unique approach for reshaping data
- Unified syntax for all type of reshaping
- Cover all reshaping functions:
  - stacking and un-stacking on single/multiple columns
  - wide to long and long to wide reshaping
  - transposing and more
Fast sorting algorithms
- Stable and Unstable HeapSort and QuickSort algorithms
- Count sort for integers
Compiler friendly grouping algorithms
- groupby!/groupby to group observation using sorting algorithms - sorted order
- gatherby to group observation using hybrid hash algorithms - observations order
- incremental grouping operation for
  groupby!/groupby, i.e. adding a column at a time
Efficient joining algorithms
- Preserve the order of observations in the left data set
- Support two methods for joining: sort-merge join and hash join.
- Customised columnar-hybrid-hash algorithms for join
- Inequality-kind (non-equi) and range joins for innerjoin, contains, semijoin!/semijoin, antijoin!/antijoin
- closejoin!/closejoin for non exact match join
- update!/update for updating a master data set with values from a transaction data set

Benchmarks

The following benchmarks present preliminary results for InMemoryDatasets.jl(IMD)'s performance compared to some data manipulation packages. The benchmarks are based on the db-benchmark repository with few changes:

Except DataFrames.jl (and basic familiarity with pandas), I have very limited knowledge about other solutions listed in db-benchmark. Thus, I simply pick some of the solutions with easy setup from the top: polars, DataFrames.jl, data.table
I report the total time, and use fail when a solution can’t complete a task
I use a Linux REHL7 machine with 16 cores and 128GB of memory. The machine is not ideal for benchmarking (not dedicated or isolated), however, the situation is the same for all solutions and I randomise the order of runs to alleviate the problem. Anyway, I have submitted a PR to the db-benchmark project.
The results are based on the latest version of the solutions + the latest PRs submitted to the db-benchmark project.
I only select the advanced questions for the groupby task. The main reason is that IMD uses less rigorous approach for basic questions - priority of optimisation.
Only data sets with no missing values are considered. The reason for this is that I initially included pandas and it couldn’t handle columns with missing values in groupby. Nevertheless, this has no effect on IMD, since IMD always uses Union type of Missing for columns.
For the join task I increase the machine’s memory to 178GB. I add 50GB of memory to make sure that there is enough space for loading data for the 50GB case.

The groupby task timing in seconds - smaller is better

Numbers in parentheses are total time for the second run only (mitigating the compilation time for IMD and DF.jl)

Data	IMD	polars	DF.jl	DT
1e7 - 2e0	9(3)	6(3)	17(6)	136(67)
1e7 - 1e1	8(2)	4(2)	13(5)	32(15)
1e7 - 1e2	9(3)	4(2)	10(3)	10(5)

1e8 - 2e0	60(28)	64(33)	150(73)	1152(586)
1e8 - 1e1	51(23)	48(24)	118(56)	414(198)
1e8 - 1e2	50(22)	41(20)	81(39)	106(52)

1e9 - 2e0	540(267)	fail	fail	fail
1e9 - 1e1	514(258)	fail	fail	3945(1904)
1e9 - 1e2	439(221)	fail	fail	1143(546)

The join task timing in seconds - smaller is better

Data	IMD	polars	DF.jl	DT
1e7	5(1.5)	3(1.5)	8(3)	8(4)
1e8	46(22)	43(23)	91(45)	88(46)
1e9	356(178)	fail	fail	fail

Acknowledgement

I like to acknowledge the contributors to Julia’s data ecosystem, especially DataFrames.jl, since the existence of their works gave the development of InMemoryDatasets.jl a head start.

xinchin · March 21, 2022, 12:17pm

I just had quick look at your package, and have a simple question, is closejoin! the same as asof join? asof join in pandas finds non exact match too.

sl-solution · March 21, 2022, 12:28pm

Yes, it is similar to asof join in pandas. It does similar job with a few more options.

bkamins · March 21, 2022, 12:39pm

A very nice package. Congratulations!

sl-solution · March 21, 2022, 12:46pm

Thanks @bkamins it’s influenced by nice packages like DataFrames.jl.

giordano · March 21, 2022, 1:32pm

Why is the second run faster also for polars, which I think is written in Rust? Is there some caching going on?

tbeason · March 21, 2022, 1:38pm

This package looks on the surface to be almost a reimplementation of DataFrames.jl. Can you elaborate on why your improvements required a separate package? The basic principles should be the same – both packages deal with general column-oriented tables.

bkamins · March 21, 2022, 2:33pm

My understanding is that the package:

was a fresh re-write (EDIT: after reading the source codes of the package it seems it took the DataFrames.jl sources that the creator liked and dropped parts that were baggage), so it does not have a baggage of not breaking things we have in DataFrames.jl.
it currently makes more assumptions what data it can store/process and uses these assumptions in the algorithms (DataFrames.jl is designed to store anything that is valid Julia “as is”). Of course in the future maybe these restrictions would be lifted.

An example of the second point:

julia> name = Dataset(ID = vcat.([1, 2, 3]), Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 Dataset
 Row │ ID        Name
     │ identity  identity
     │ Array…?   String?
─────┼─────────────────────
   1 │ [1]       John Doe
   2 │ [2]       Jane Doe
   3 │ [3]       Joe Blogs

julia> job = Dataset(ID = vcat.([1, 2, 2, 4]), Job = ["Lawyer", "Doctor", "Florist", "Farmer"])
4×2 Dataset
 Row │ ID        Job
     │ identity  identity
     │ Array…?   String?
─────┼────────────────────
   1 │ [1]       Lawyer
   2 │ [2]       Doctor
   3 │ [2]       Florist
   4 │ [4]       Farmer

julia> leftjoin(name, job, on = :ID)
ERROR: MethodError: Cannot `convert` an object of type Vector{Int64} to an object of type Integer

julia> leftjoin(DataFrame(name), DataFrame(job), on = :ID)
4×3 DataFrame
 Row │ ID      Name       Job
     │ Array…  String     String?
─────┼────────────────────────────
   1 │ [1]     John Doe   Lawyer
   2 │ [2]     Jane Doe   Doctor
   3 │ [2]     Jane Doe   Florist
   4 │ [3]     Joe Blogs  missing

nilshg · March 21, 2022, 2:38pm

I don’t understand the internals well enough, but assuming that your point here is that leftjoin in InMemoryDatasets squeezes out extra performance by restricting the valid types of index columns to join on, would you consider this a missing optimization in DataFrames which could be filled in with multiple dispatch providing a “fast path” leftjoin for certain column types?

bkamins · March 21, 2022, 2:43pm

Yes.

sl-solution · March 21, 2022, 5:34pm

I don’t know about polars, however, to be clear, the second run is not always faster for polars.

sl-solution · March 21, 2022, 5:38pm

leftjoin by default uses the sort method, for situation that sort method is not well defined user should use hash method for joining, thus:

julia> leftjoin(name, job, on = :ID, method = :hash)
4×3 Dataset
 Row │ ID        Name       Job      
     │ identity  identity   identity 
     │ Array…?   String?    String?  
─────┼───────────────────────────────
   1 │ [1]       John Doe   Lawyer
   2 │ [2]       Jane Doe   Doctor
   3 │ [2]       Jane Doe   Florist
   4 │ [3]       Joe Blogs  missing

giordano · March 21, 2022, 5:46pm

Oh, I must have read it wrong

sl-solution · March 21, 2022, 6:41pm

Both packages are for data manipulation, but on surface and internally they are very different.

internal The algorithms in IMD build from scratch for columnnar tables and the way Julia works. Most of these algorithms are home made to fit some criteria that I had in mind and you wouldn’t find them anywhere else.
on surface I mentioned some differences in the announcement, however, those are just few of them. I provided more details of the IMD features in its documentation. I tried to keep the syntax of IMD familiar to DataFrames users but it doesn’t mean IMD uses the same syntax as DataFrames; some places they just use similar name for functions but the syntax is very different, like filter, some places they use similar name with similar syntax but different options, like unique.

jar1 · March 21, 2022, 8:01pm

The most significant difference from my perspective is that InMemoryDatasets.jl uses the strategy of skipping missing values by default. In contrast to DataFrames.jl, InMemoryDatasets.jl

skips missing values in aggregation functions over its Dataset types, and
skips missing values in aggregation functions over all types, by pirating Base’s aggregations.

For example in Base Julia,

julia> maximum([1,1,missing])
missing

but with using InMemoryDatasets

julia> maximum([1,1,missing])
1

sl-solution · March 21, 2022, 8:36pm

You many find more about how IMD treats missing values in its documentation.

davidanthoff · March 21, 2022, 8:58pm

This is a very cool package!

The only thing I would very strongly recommend is to not do this:

Changing the semantics of functions from Base in such a fundamental way is really considered bad practice. It is super confusing for users, and it can introduce the most unfortunate bugs for users without them ever being aware of it. If I had my way, I would actually not allow registration of packages that do things like that in the general registry

I think if you aren’t happy with the semantics of Missings in base (and I have quite a bit of sympathy for that), you either need to define new functions that behave the way you want or use a different type for missing values that is under your control.

monopolynomial · March 21, 2022, 9:39pm

Congratulations! it is very very nice package. I was immediately sold with the first feature in your list . As a data scientists I was avoiding Julia as the first choice due to the lack of practical data manipulation tool, but I guess your package is changing everything for me

monopolynomial · March 21, 2022, 10:08pm

I forgot to mention that I love the way you treat missing values, please please keep it this wayit simplifies my workflow significantly.

rafael.guerra · March 21, 2022, 11:14pm

What? Could you please ellaborate.

Topic		Replies	Views
How is the data ecosystem right now for large datasets? Data	35	6830	July 13, 2017
Julia performs poorly on group-by benchmarks Data performance	48	5935	January 23, 2019
[ANN] DataFrameDBs.jl Data package , announcement	60	4202	May 2, 2020
DataTables or DataFrames? Data question	32	15489	November 19, 2018
ANN: LightQuery tutorial Package Announcements tutorials	43	3288	July 11, 2019

[ANN] A new lightning fast package for data manipulation in pure Julia

Introduction

Features

Benchmarks

Acknowledgement

Related topics