I am excited to announce a new package for data manipulation in pure Julia.
Introduction
InMemoryDatasets.jl is a multi-threaded package for data manipulation and is designed for Julia 1.6+ (64bit OS). The core computation engine of the package is a set of customised algorithms developed specifically for columnar tables. The package performance is tuned with two goals in mind, a) low overhead of allowing missing values everywhere, and b) the following priorities - in order of importance:
Low compilation time
Memory efficiency
High performance
I tried to keep the overall complexity of the package as low as possible to simplify:
the maintenance of the package
adding new features to the package
contributing to the package
Features
InMemoryDatasets.jl has many interesting features which I here highlight some of my favourites (in no particular order):
Assigning a named function to a column as its format
By default, formatted values are used for operations like: displaying, sorting, grouping, joining,β¦
Format evaluation is lazy
Formats donβt change the actual values
Multi-threading across the whole package
Most functions in InMemoryDatasets.jl exploit all cores available to Julia by default
Disabling parallel computation via passing the threads = false keyword argument to functions
stacking and un-stacking on single/multiple columns
wide to long and long to wide reshaping
transposing and more
Fast sorting algorithms
Stable and Unstable HeapSort and QuickSort algorithms
Count sort for integers
Compiler friendly grouping algorithms
groupby!/groupby to group observation using sorting algorithms - sorted order
gatherby to group observation using hybrid hash algorithms - observations order
incremental grouping operation for groupby!/groupby, i.e. adding a column at a time
Efficient joining algorithms
Preserve the order of observations in the left data set
Support two methods for joining: sort-merge join and hash join.
Customised columnar-hybrid-hash algorithms for join
Inequality-kind (non-equi) and range joins for innerjoin, contains, semijoin!/semijoin, antijoin!/antijoin
closejoin!/closejoin for non exact match join
update!/update for updating a master data set with values from a transaction data set
Benchmarks
The following benchmarks present preliminary results for InMemoryDatasets.jl(IMD)'s performance compared to some data manipulation packages. The benchmarks are based on the db-benchmark repository with few changes:
Except DataFrames.jl (and basic familiarity with pandas), I have very limited knowledge about other solutions listed in db-benchmark. Thus, I simply pick some of the solutions with easy setup from the top: polars, DataFrames.jl, data.table
I report the total time, and use fail when a solution canβt complete a task
I use a Linux REHL7 machine with 16 cores and 128GB of memory. The machine is not ideal for benchmarking (not dedicated or isolated), however, the situation is the same for all solutions and I randomise the order of runs to alleviate the problem. Anyway, I have submitted a PR to the db-benchmark project.
The results are based on the latest version of the solutions + the latest PRs submitted to the db-benchmark project.
I only select the advanced questions for the groupby task. The main reason is that IMD uses less rigorous approach for basic questions - priority of optimisation.
Only data sets with no missing values are considered. The reason for this is that I initially included pandas and it couldnβt handle columns with missing values in groupby. Nevertheless, this has no effect on IMD, since IMD always uses Union type of Missing for columns.
For the join task I increase the machineβs memory to 178GB. I add 50GB of memory to make sure that there is enough space for loading data for the 50GB case.
The groupby task timing in seconds - smaller is better
Numbers in parentheses are total time for the second run only (mitigating the compilation time for IMD and DF.jl)
Data
IMD
polars
DF.jl
DT
1e7 - 2e0
9(3)
6(3)
17(6)
136(67)
1e7 - 1e1
8(2)
4(2)
13(5)
32(15)
1e7 - 1e2
9(3)
4(2)
10(3)
10(5)
1e8 - 2e0
60(28)
64(33)
150(73)
1152(586)
1e8 - 1e1
51(23)
48(24)
118(56)
414(198)
1e8 - 1e2
50(22)
41(20)
81(39)
106(52)
1e9 - 2e0
540(267)
fail
fail
fail
1e9 - 1e1
514(258)
fail
fail
3945(1904)
1e9 - 1e2
439(221)
fail
fail
1143(546)
The join task timing in seconds - smaller is better
Data
IMD
polars
DF.jl
DT
1e7
5(1.5)
3(1.5)
8(3)
8(4)
1e8
46(22)
43(23)
91(45)
88(46)
1e9
356(178)
fail
fail
fail
Acknowledgement
I like to acknowledge the contributors to Juliaβs data ecosystem, especially DataFrames.jl, since the existence of their works gave the development of InMemoryDatasets.jl a head start.
This package looks on the surface to be almost a reimplementation of DataFrames.jl. Can you elaborate on why your improvements required a separate package? The basic principles should be the same β both packages deal with general column-oriented tables.
was a fresh re-write (EDIT: after reading the source codes of the package it seems it took the DataFrames.jl sources that the creator liked and dropped parts that were baggage), so it does not have a baggage of not breaking things we have in DataFrames.jl.
it currently makes more assumptions what data it can store/process and uses these assumptions in the algorithms (DataFrames.jl is designed to store anything that is valid Julia βas isβ). Of course in the future maybe these restrictions would be lifted.
An example of the second point:
julia> name = Dataset(ID = vcat.([1, 2, 3]), Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3Γ2 Dataset
Row β ID Name
β identity identity
β Arrayβ¦? String?
ββββββΌβββββββββββββββββββββ
1 β [1] John Doe
2 β [2] Jane Doe
3 β [3] Joe Blogs
julia> job = Dataset(ID = vcat.([1, 2, 2, 4]), Job = ["Lawyer", "Doctor", "Florist", "Farmer"])
4Γ2 Dataset
Row β ID Job
β identity identity
β Arrayβ¦? String?
ββββββΌββββββββββββββββββββ
1 β [1] Lawyer
2 β [2] Doctor
3 β [2] Florist
4 β [4] Farmer
julia> leftjoin(name, job, on = :ID)
ERROR: MethodError: Cannot `convert` an object of type Vector{Int64} to an object of type Integer
julia> leftjoin(DataFrame(name), DataFrame(job), on = :ID)
4Γ3 DataFrame
Row β ID Name Job
β Arrayβ¦ String String?
ββββββΌββββββββββββββββββββββββββββ
1 β [1] John Doe Lawyer
2 β [2] Jane Doe Doctor
3 β [2] Jane Doe Florist
4 β [3] Joe Blogs missing
I donβt understand the internals well enough, but assuming that your point here is that leftjoin in InMemoryDatasets squeezes out extra performance by restricting the valid types of index columns to join on, would you consider this a missing optimization in DataFrames which could be filled in with multiple dispatch providing a βfast pathβ leftjoin for certain column types?
leftjoin by default uses the sort method, for situation that sort method is not well defined user should use hash method for joining, thus:
julia> leftjoin(name, job, on = :ID, method = :hash)
4Γ3 Dataset
Row β ID Name Job
β identity identity identity
β Arrayβ¦? String? String?
ββββββΌβββββββββββββββββββββββββββββββ
1 β [1] John Doe Lawyer
2 β [2] Jane Doe Doctor
3 β [2] Jane Doe Florist
4 β [3] Joe Blogs missing
Both packages are for data manipulation, but on surface and internally they are very different.
internal The algorithms in IMD build from scratch for columnnar tables and the way Julia works. Most of these algorithms are home made to fit some criteria that I had in mind and you wouldnβt find them anywhere else. on surface I mentioned some differences in the announcement, however, those are just few of them. I provided more details of the IMD features in its documentation. I tried to keep the syntax of IMD familiar to DataFrames users but it doesnβt mean IMD uses the same syntax as DataFrames; some places they just use similar name for functions but the syntax is very different, like filter, some places they use similar name with similar syntax but different options, like unique.
The most significant difference from my perspective is that InMemoryDatasets.jl uses the strategy of skipping missing values by default. In contrast to DataFrames.jl, InMemoryDatasets.jl
skips missing values in aggregation functions over its Dataset types, and
skips missing values in aggregation functions over all types, by pirating Baseβs aggregations.
The only thing I would very strongly recommend is to not do this:
Changing the semantics of functions from Base in such a fundamental way is really considered bad practice. It is super confusing for users, and it can introduce the most unfortunate bugs for users without them ever being aware of it. If I had my way, I would actually not allow registration of packages that do things like that in the general registry
I think if you arenβt happy with the semantics of Missings in base (and I have quite a bit of sympathy for that), you either need to define new functions that behave the way you want or use a different type for missing values that is under your control.
Congratulations! it is very very nice package. I was immediately sold with the first feature in your list . As a data scientists I was avoiding Julia as the first choice due to the lack of practical data manipulation tool, but I guess your package is changing everything for me