[ANN] A new lightning fast package for data manipulation in pure Julia

I don’t understand the internals well enough, but assuming that your point here is that leftjoin in InMemoryDatasets squeezes out extra performance by restricting the valid types of index columns to join on, would you consider this a missing optimization in DataFrames which could be filled in with multiple dispatch providing a “fast path” leftjoin for certain column types?

2 Likes

Yes.

6 Likes

I don’t know about polars, however, to be clear, the second run is not always faster for polars.

leftjoin by default uses the sort method, for situation that sort method is not well defined user should use hash method for joining, thus:

julia> leftjoin(name, job, on = :ID, method = :hash)
4×3 Dataset
 Row │ ID        Name       Job      
     │ identity  identity   identity 
     │ Array…?   String?    String?  
─────┼───────────────────────────────
   1 │ [1]       John Doe   Lawyer
   2 │ [2]       Jane Doe   Doctor
   3 │ [2]       Jane Doe   Florist
   4 │ [3]       Joe Blogs  missing  
8 Likes

Oh, I must have read it wrong

image
image

1 Like

Both packages are for data manipulation, but on surface and internally they are very different.

internal The algorithms in IMD build from scratch for columnnar tables and the way Julia works. Most of these algorithms are home made to fit some criteria that I had in mind and you wouldn’t find them anywhere else.
on surface I mentioned some differences in the announcement, however, those are just few of them. I provided more details of the IMD features in its documentation. I tried to keep the syntax of IMD familiar to DataFrames users but it doesn’t mean IMD uses the same syntax as DataFrames; some places they just use similar name for functions but the syntax is very different, like filter, some places they use similar name with similar syntax but different options, like unique.

5 Likes

The most significant difference from my perspective is that InMemoryDatasets.jl uses the strategy of skipping missing values by default. In contrast to DataFrames.jl, InMemoryDatasets.jl

  • skips missing values in aggregation functions over its Dataset types, and
  • skips missing values in aggregation functions over all types, by pirating Base’s aggregations.

For example in Base Julia,

julia> maximum([1,1,missing])
missing

but with using InMemoryDatasets

julia> maximum([1,1,missing])
1

You many find more about how IMD treats missing values in its documentation.

This is a very cool package!

The only thing I would very strongly recommend is to not do this:

Changing the semantics of functions from Base in such a fundamental way is really considered bad practice. It is super confusing for users, and it can introduce the most unfortunate bugs for users without them ever being aware of it. If I had my way, I would actually not allow registration of packages that do things like that in the general registry :slight_smile:

I think if you aren’t happy with the semantics of Missings in base (and I have quite a bit of sympathy for that), you either need to define new functions that behave the way you want or use a different type for missing values that is under your control.

33 Likes

Congratulations! it is very very nice package. I was immediately sold with the first feature in your list :slight_smile: . As a data scientists I was avoiding Julia as the first choice due to the lack of practical data manipulation tool, but I guess your package is changing everything for me :pray:

1 Like

I forgot to mention that I love the way you treat missing values, please please :pray: keep it this wayit simplifies my workflow significantly.

3 Likes

What? Could you please ellaborate.

Congratulations @sl-solution ! It seems a very good package :slight_smile:

Btw, I saw you are using PrettyTables.jl to print the data! Please, feel free to ping me if you need some feature or, specially, if I break something :smiley: PrettyTables.jl is passing for a huge rewrite that will greatly increase its performance (time to print the first table is down by almost 50%). I am trying as hard as I can to avoid breaking changes, but it can happen. I will remember to check the interoperability with your package before I release v2.0.

I think the new release will fix this problem in your comments:

    # Print the table with the selected options.
    # currently pretty_table is very slow for large tables, the workaround is to use only few rows

Of course it will always be slow when printing the entire very big table. But it should now be very fast printing any table when cropping is enabled:

julia> A = rand(1_000_000, 1_000);

julia> @time pretty_table(A)
┌───────────┬──────────┬────────────┬────────────┬───────────┬──────────┬───────────┬───────────┬──────────┬───────────┬──────
│    Col. 1 │   Col. 2 │     Col. 3 │     Col. 4 │    Col. 5 │   Col. 6 │    Col. 7 │    Col. 8 │   Col. 9 │   Col. 10 │   C ⋯
├───────────┼──────────┼────────────┼────────────┼───────────┼──────────┼───────────┼───────────┼──────────┼───────────┼──────
│  0.675934 │ 0.221392 │   0.859381 │ 0.00201495 │  0.656669 │ 0.419674 │  0.116045 │  0.555897 │ 0.189247 │  0.552384 │  0. ⋯
│   0.61972 │ 0.964157 │   0.543965 │  0.0924698 │  0.408849 │  0.22149 │  0.801567 │  0.273067 │ 0.185251 │  0.670841 │  0. ⋯
│ 0.0341166 │ 0.550614 │    0.62682 │  0.0991155 │  0.435398 │ 0.676617 │  0.109501 │  0.620581 │  0.92127 │  0.560164 │  0. ⋯
│   0.86105 │ 0.587744 │    0.25295 │   0.342427 │  0.602571 │ 0.524927 │  0.893778 │  0.925155 │ 0.571104 │  0.736807 │ 0.0 ⋯
│  0.435085 │ 0.178483 │   0.596313 │   0.488782 │  0.104792 │ 0.994904 │   0.08668 │  0.302552 │ 0.099019 │  0.448827 │   0 ⋯
│  0.658401 │ 0.106824 │ 0.00276253 │   0.447873 │ 0.0350634 │ 0.800669 │  0.215574 │  0.375465 │  0.11485 │  0.661147 │  0. ⋯
│  0.815405 │  0.22639 │   0.585754 │   0.129567 │ 0.0261965 │  0.58881 │  0.575382 │  0.811007 │ 0.380854 │  0.890361 │     ⋯
│ 0.0420148 │ 0.917764 │   0.621537 │   0.605215 │ 0.0492217 │ 0.182624 │  0.370627 │  0.226672 │ 0.597551 │  0.387021 │  0. ⋯
│     ⋮     │    ⋮     │     ⋮      │     ⋮      │     ⋮     │    ⋮     │     ⋮     │     ⋮     │    ⋮     │     ⋮     │     ⋱
└───────────┴──────────┴────────────┴────────────┴───────────┴──────────┴───────────┴───────────┴──────────┴───────────┴──────
                                                                                           990 columns and 999992 rows omitted
  0.001633 seconds (11.27 k allocations: 568.234 KiB)

julia> @time pretty_table(A, vcrop_mode = :middle)
┌───────────┬──────────┬────────────┬────────────┬───────────┬──────────┬──────────┬──────────┬──────────┬──────────┬─────────
│    Col. 1 │   Col. 2 │     Col. 3 │     Col. 4 │    Col. 5 │   Col. 6 │   Col. 7 │   Col. 8 │   Col. 9 │  Col. 10 │   Col. ⋯
├───────────┼──────────┼────────────┼────────────┼───────────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────
│  0.675934 │ 0.221392 │   0.859381 │ 0.00201495 │  0.656669 │ 0.419674 │ 0.116045 │ 0.555897 │ 0.189247 │ 0.552384 │  0.478 ⋯
│   0.61972 │ 0.964157 │   0.543965 │  0.0924698 │  0.408849 │  0.22149 │ 0.801567 │ 0.273067 │ 0.185251 │ 0.670841 │  0.171 ⋯
│ 0.0341166 │ 0.550614 │    0.62682 │  0.0991155 │  0.435398 │ 0.676617 │ 0.109501 │ 0.620581 │  0.92127 │ 0.560164 │  0.854 ⋯
│   0.86105 │ 0.587744 │    0.25295 │   0.342427 │  0.602571 │ 0.524927 │ 0.893778 │ 0.925155 │ 0.571104 │ 0.736807 │ 0.0752 ⋯
│     ⋮     │    ⋮     │     ⋮      │     ⋮      │     ⋮     │    ⋮     │    ⋮     │    ⋮     │    ⋮     │    ⋮     │     ⋮  ⋱
│  0.874067 │ 0.689295 │   0.969623 │   0.940648 │  0.932225 │ 0.769949 │ 0.394852 │ 0.600234 │ 0.740254 │  0.36743 │  0.870 ⋯
│  0.489394 │  0.19652 │   0.881438 │    0.29382 │  0.890437 │ 0.330823 │ 0.139547 │ 0.814829 │ 0.769702 │ 0.584777 │  0.882 ⋯
│   0.93291 │ 0.204729 │   0.236622 │  0.0458418 │  0.251297 │ 0.815881 │ 0.404949 │ 0.303269 │ 0.749317 │ 0.827221 │  0.893 ⋯
│  0.587404 │ 0.911563 │   0.193175 │   0.153903 │  0.638026 │ 0.426905 │ 0.358063 │ 0.860344 │ 0.108626 │ 0.651241 │  0.444 ⋯
└───────────┴──────────┴────────────┴────────────┴───────────┴──────────┴──────────┴──────────┴──────────┴──────────┴─────────
                                                                                           990 columns and 999992 rows omitted
  0.001556 seconds (11.50 k allocations: 571.672 KiB)

(Notice that those are the second run of the command)

13 Likes

It’s quite a while I am monitoring julia’s dataframes package but anytime I wanted to use it for a project I hit lack-of-features wall. as examples I frequently need to pivot_long_to_wide or visa versa but nothing was available in dataframes. also functionality which I often need is to non-equi join dataframes which it wasn’t there. but I’m happy to see both of them in this announcement. to be frank, for me a practical solution with more features is more important than obsession with speed or abstractness.

5 Likes

Many thanks for the “home made” algo and implementation. IMHO it is good to the community.

Currently no need to worry about changing several Base functions when missing involved because given the long time of testing (and discussion etc.) DataFrames.jl will still be #1 choice for most (potential) users.

I’ll set aside some free time to learn/test this new package. So thanks again.

1 Like

The main problem with piracy isn’t really that it is surprising to the immediate users, since they can always read the docs (assuming it is mentioned there); the problem is that it is surprising to users who don’t know they are using the package when it is a dependency of a dependency of a dependency etc. Currently, if any package in a user’s full dependency tree uses InMemoryDatasets, then the behavior of Base functions changes everywhere – and they very well might not know they are depending on InMemoryDatasets! (Consider the user who hasn’t seen this Discourse thread, for example). This could cause all sorts of bugs, since other code (and other packages) will expect those functions to not have been pirated. In other words, it’s non-composable.

I think this makes sense given the wide-ranging consequences of type piracy, especially if we had a reliable and robust check. File an issue in RegistryCI?

23 Likes

A very nice package indeed, congrats!

The DataFrames.jl package is one of the few packages in Julia that I frequently use, and honestly is the reason that I started using Julia in the first place. I am very glad to see now there are two packages in Julia that fit my usage.

I occasionally answers questions related to DataFrames in this forum and probably I will use your package in future answers when it is appropriate. Actually, right now, I’m reading your rather comprehensive package documentation and simply enjoying it.

I admire your courage to start working on such a general type of package with those many competitors out there.

PS, the benchmarks are outstanding and I want to thank you personally for beating polars and data.table in one shot!

7 Likes

I am also reading the documentation now.

I wonder whether there is a benchmark on datasets with around a few thousand rows without multithreading as we are not always with big data. Will there be significant performance deterioration?

Also, is there a way to skip the warning below?

┌ Warning: Julia started with single thread, to enable multithreaded functionalities in InMemoryDatasets.jl start Julia with multiple threads.

You can start Julia with multiple threads by setting the environmental variable JULIA_NUM_THREADS, more on this is available here and here.

I mean I normally start with single thread so I do not want to see this warning.