[ANN] A new lightning fast package for data manipulation in pure Julia

I am excited to announce a new package for data manipulation in pure Julia.

Introduction

InMemoryDatasets.jl is a multi-threaded package for data manipulation and is designed for Julia 1.6+ (64bit OS). The core computation engine of the package is a set of customised algorithms developed specifically for columnar tables. The package performance is tuned with two goals in mind, a) low overhead of allowing missing values everywhere, and b) the following priorities - in order of importance:

  1. Low compilation time
  2. Memory efficiency
  3. High performance

I tried to keep the overall complexity of the package as low as possible to simplify:

  • the maintenance of the package
  • adding new features to the package
  • contributing to the package

Features

InMemoryDatasets.jl has many interesting features which I here highlight some of my favourites (in no particular order):

  • Assigning a named function to a column as its format
    • By default, formatted values are used for operations like: displaying, sorting, grouping, joining,…
    • Format evaluation is lazy
    • Formats don’t change the actual values
  • Multi-threading across the whole package
    • Most functions in InMemoryDatasets.jl exploit all cores available to Julia by default
    • Disabling parallel computation via passing the threads = false keyword argument to functions
  • Powerful row-wise operations
    • Support many common operations
    • Specialised operations for modifying columns
    • Customised row-wise operations for filtering observations / filter simply wraps byrow
  • Unique approach for reshaping data
    • Unified syntax for all type of reshaping
    • Cover all reshaping functions:
      • stacking and un-stacking on single/multiple columns
      • wide to long and long to wide reshaping
      • transposing and more
  • Fast sorting algorithms
    • Stable and Unstable HeapSort and QuickSort algorithms
    • Count sort for integers
  • Compiler friendly grouping algorithms
    • groupby!/groupby to group observation using sorting algorithms - sorted order
    • gatherby to group observation using hybrid hash algorithms - observations order
    • incremental grouping operation for
      groupby!/groupby, i.e. adding a column at a time
  • Efficient joining algorithms
    • Preserve the order of observations in the left data set
    • Support two methods for joining: sort-merge join and hash join.
    • Customised columnar-hybrid-hash algorithms for join
    • Inequality-kind (non-equi) and range joins for innerjoin, contains, semijoin!/semijoin, antijoin!/antijoin
    • closejoin!/closejoin for non exact match join
    • update!/update for updating a master data set with values from a transaction data set

Benchmarks

The following benchmarks present preliminary results for InMemoryDatasets.jl(IMD)'s performance compared to some data manipulation packages. The benchmarks are based on the db-benchmark repository with few changes:

  • Except DataFrames.jl (and basic familiarity with pandas), I have very limited knowledge about other solutions listed in db-benchmark. Thus, I simply pick some of the solutions with easy setup from the top: polars, DataFrames.jl, data.table

  • I report the total time, and use fail when a solution can’t complete a task

  • I use a Linux REHL7 machine with 16 cores and 128GB of memory. The machine is not ideal for benchmarking (not dedicated or isolated), however, the situation is the same for all solutions and I randomise the order of runs to alleviate the problem. Anyway, I have submitted a PR to the db-benchmark project.

  • The results are based on the latest version of the solutions + the latest PRs submitted to the db-benchmark project.

  • I only select the advanced questions for the groupby task. The main reason is that IMD uses less rigorous approach for basic questions - priority of optimisation.

  • Only data sets with no missing values are considered. The reason for this is that I initially included pandas and it couldn’t handle columns with missing values in groupby. Nevertheless, this has no effect on IMD, since IMD always uses Union type of Missing for columns.

  • For the join task I increase the machine’s memory to 178GB. I add 50GB of memory to make sure that there is enough space for loading data for the 50GB case.

The groupby task timing in seconds - smaller is better

Numbers in parentheses are total time for the second run only (mitigating the compilation time for IMD and DF.jl)

Data IMD polars DF.jl DT
1e7 - 2e0 9(3) 6(3) 17(6) 136(67)
1e7 - 1e1 8(2) 4(2) 13(5) 32(15)
1e7 - 1e2 9(3) 4(2) 10(3) 10(5)
1e8 - 2e0 60(28) 64(33) 150(73) 1152(586)
1e8 - 1e1 51(23) 48(24) 118(56) 414(198)
1e8 - 1e2 50(22) 41(20) 81(39) 106(52)
1e9 - 2e0 540(267) fail fail fail
1e9 - 1e1 514(258) fail fail 3945(1904)
1e9 - 1e2 439(221) fail fail 1143(546)

The join task timing in seconds - smaller is better

Data IMD polars DF.jl DT
1e7 5(1.5) 3(1.5) 8(3) 8(4)
1e8 46(22) 43(23) 91(45) 88(46)
1e9 356(178) fail fail fail

Acknowledgement

I like to acknowledge the contributors to Julia’s data ecosystem, especially DataFrames.jl, since the existence of their works gave the development of InMemoryDatasets.jl a head start.

73 Likes

I just had quick look at your package, and have a simple question, is closejoin! the same as asof join? asof join in pandas finds non exact match too.

1 Like

Yes, it is similar to asof join in pandas. It does similar job with a few more options.

A very nice package. Congratulations!

10 Likes

Thanks @bkamins it’s influenced by nice packages like DataFrames.jl.

2 Likes

Why is the second run faster also for polars, which I think is written in Rust? Is there some caching going on?

This package looks on the surface to be almost a reimplementation of DataFrames.jl. Can you elaborate on why your improvements required a separate package? The basic principles should be the same – both packages deal with general column-oriented tables.

4 Likes

My understanding is that the package:

  1. was a fresh re-write (EDIT: after reading the source codes of the package it seems it took the DataFrames.jl sources that the creator liked and dropped parts that were baggage), so it does not have a baggage of not breaking things we have in DataFrames.jl.
  2. it currently makes more assumptions what data it can store/process and uses these assumptions in the algorithms (DataFrames.jl is designed to store anything that is valid Julia β€œas is”). Of course in the future maybe these restrictions would be lifted.

An example of the second point:

julia> name = Dataset(ID = vcat.([1, 2, 3]), Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3Γ—2 Dataset
 Row β”‚ ID        Name
     β”‚ identity  identity
     β”‚ Array…?   String?
─────┼─────────────────────
   1 β”‚ [1]       John Doe
   2 β”‚ [2]       Jane Doe
   3 β”‚ [3]       Joe Blogs

julia> job = Dataset(ID = vcat.([1, 2, 2, 4]), Job = ["Lawyer", "Doctor", "Florist", "Farmer"])
4Γ—2 Dataset
 Row β”‚ ID        Job
     β”‚ identity  identity
     β”‚ Array…?   String?
─────┼────────────────────
   1 β”‚ [1]       Lawyer
   2 β”‚ [2]       Doctor
   3 β”‚ [2]       Florist
   4 β”‚ [4]       Farmer

julia> leftjoin(name, job, on = :ID)
ERROR: MethodError: Cannot `convert` an object of type Vector{Int64} to an object of type Integer

julia> leftjoin(DataFrame(name), DataFrame(job), on = :ID)
4Γ—3 DataFrame
 Row β”‚ ID      Name       Job
     β”‚ Array…  String     String?
─────┼────────────────────────────
   1 β”‚ [1]     John Doe   Lawyer
   2 β”‚ [2]     Jane Doe   Doctor
   3 β”‚ [2]     Jane Doe   Florist
   4 β”‚ [3]     Joe Blogs  missing
5 Likes

I don’t understand the internals well enough, but assuming that your point here is that leftjoin in InMemoryDatasets squeezes out extra performance by restricting the valid types of index columns to join on, would you consider this a missing optimization in DataFrames which could be filled in with multiple dispatch providing a β€œfast path” leftjoin for certain column types?

2 Likes

Yes.

6 Likes

I don’t know about polars, however, to be clear, the second run is not always faster for polars.

leftjoin by default uses the sort method, for situation that sort method is not well defined user should use hash method for joining, thus:

julia> leftjoin(name, job, on = :ID, method = :hash)
4Γ—3 Dataset
 Row β”‚ ID        Name       Job      
     β”‚ identity  identity   identity 
     β”‚ Array…?   String?    String?  
─────┼───────────────────────────────
   1 β”‚ [1]       John Doe   Lawyer
   2 β”‚ [2]       Jane Doe   Doctor
   3 β”‚ [2]       Jane Doe   Florist
   4 β”‚ [3]       Joe Blogs  missing  
7 Likes

Oh, I must have read it wrong

image
image

1 Like

Both packages are for data manipulation, but on surface and internally they are very different.

internal The algorithms in IMD build from scratch for columnnar tables and the way Julia works. Most of these algorithms are home made to fit some criteria that I had in mind and you wouldn’t find them anywhere else.
on surface I mentioned some differences in the announcement, however, those are just few of them. I provided more details of the IMD features in its documentation. I tried to keep the syntax of IMD familiar to DataFrames users but it doesn’t mean IMD uses the same syntax as DataFrames; some places they just use similar name for functions but the syntax is very different, like filter, some places they use similar name with similar syntax but different options, like unique.

5 Likes

The most significant difference from my perspective is that InMemoryDatasets.jl uses the strategy of skipping missing values by default. In contrast to DataFrames.jl, InMemoryDatasets.jl

  • skips missing values in aggregation functions over its Dataset types, and
  • skips missing values in aggregation functions over all types, by pirating Base’s aggregations.

For example in Base Julia,

julia> maximum([1,1,missing])
missing

but with using InMemoryDatasets

julia> maximum([1,1,missing])
1

You many find more about how IMD treats missing values in its documentation.

This is a very cool package!

The only thing I would very strongly recommend is to not do this:

Changing the semantics of functions from Base in such a fundamental way is really considered bad practice. It is super confusing for users, and it can introduce the most unfortunate bugs for users without them ever being aware of it. If I had my way, I would actually not allow registration of packages that do things like that in the general registry :slight_smile:

I think if you aren’t happy with the semantics of Missings in base (and I have quite a bit of sympathy for that), you either need to define new functions that behave the way you want or use a different type for missing values that is under your control.

34 Likes

Congratulations! it is very very nice package. I was immediately sold with the first feature in your list :slight_smile: . As a data scientists I was avoiding Julia as the first choice due to the lack of practical data manipulation tool, but I guess your package is changing everything for me :pray:

1 Like

I forgot to mention that I love the way you treat missing values, please please :pray: keep it this wayit simplifies my workflow significantly.

3 Likes

What? Could you please ellaborate.