[ANN] A new lightning fast package for data manipulation in pure Julia

Congratulations @sl-solution ! It seems a very good package :slight_smile:

Btw, I saw you are using PrettyTables.jl to print the data! Please, feel free to ping me if you need some feature or, specially, if I break something :smiley: PrettyTables.jl is passing for a huge rewrite that will greatly increase its performance (time to print the first table is down by almost 50%). I am trying as hard as I can to avoid breaking changes, but it can happen. I will remember to check the interoperability with your package before I release v2.0.

I think the new release will fix this problem in your comments:

    # Print the table with the selected options.
    # currently pretty_table is very slow for large tables, the workaround is to use only few rows

Of course it will always be slow when printing the entire very big table. But it should now be very fast printing any table when cropping is enabled:

julia> A = rand(1_000_000, 1_000);

julia> @time pretty_table(A)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€
โ”‚    Col. 1 โ”‚   Col. 2 โ”‚     Col. 3 โ”‚     Col. 4 โ”‚    Col. 5 โ”‚   Col. 6 โ”‚    Col. 7 โ”‚    Col. 8 โ”‚   Col. 9 โ”‚   Col. 10 โ”‚   C โ‹ฏ
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€
โ”‚  0.675934 โ”‚ 0.221392 โ”‚   0.859381 โ”‚ 0.00201495 โ”‚  0.656669 โ”‚ 0.419674 โ”‚  0.116045 โ”‚  0.555897 โ”‚ 0.189247 โ”‚  0.552384 โ”‚  0. โ‹ฏ
โ”‚   0.61972 โ”‚ 0.964157 โ”‚   0.543965 โ”‚  0.0924698 โ”‚  0.408849 โ”‚  0.22149 โ”‚  0.801567 โ”‚  0.273067 โ”‚ 0.185251 โ”‚  0.670841 โ”‚  0. โ‹ฏ
โ”‚ 0.0341166 โ”‚ 0.550614 โ”‚    0.62682 โ”‚  0.0991155 โ”‚  0.435398 โ”‚ 0.676617 โ”‚  0.109501 โ”‚  0.620581 โ”‚  0.92127 โ”‚  0.560164 โ”‚  0. โ‹ฏ
โ”‚   0.86105 โ”‚ 0.587744 โ”‚    0.25295 โ”‚   0.342427 โ”‚  0.602571 โ”‚ 0.524927 โ”‚  0.893778 โ”‚  0.925155 โ”‚ 0.571104 โ”‚  0.736807 โ”‚ 0.0 โ‹ฏ
โ”‚  0.435085 โ”‚ 0.178483 โ”‚   0.596313 โ”‚   0.488782 โ”‚  0.104792 โ”‚ 0.994904 โ”‚   0.08668 โ”‚  0.302552 โ”‚ 0.099019 โ”‚  0.448827 โ”‚   0 โ‹ฏ
โ”‚  0.658401 โ”‚ 0.106824 โ”‚ 0.00276253 โ”‚   0.447873 โ”‚ 0.0350634 โ”‚ 0.800669 โ”‚  0.215574 โ”‚  0.375465 โ”‚  0.11485 โ”‚  0.661147 โ”‚  0. โ‹ฏ
โ”‚  0.815405 โ”‚  0.22639 โ”‚   0.585754 โ”‚   0.129567 โ”‚ 0.0261965 โ”‚  0.58881 โ”‚  0.575382 โ”‚  0.811007 โ”‚ 0.380854 โ”‚  0.890361 โ”‚     โ‹ฏ
โ”‚ 0.0420148 โ”‚ 0.917764 โ”‚   0.621537 โ”‚   0.605215 โ”‚ 0.0492217 โ”‚ 0.182624 โ”‚  0.370627 โ”‚  0.226672 โ”‚ 0.597551 โ”‚  0.387021 โ”‚  0. โ‹ฏ
โ”‚     โ‹ฎ     โ”‚    โ‹ฎ     โ”‚     โ‹ฎ      โ”‚     โ‹ฎ      โ”‚     โ‹ฎ     โ”‚    โ‹ฎ     โ”‚     โ‹ฎ     โ”‚     โ‹ฎ     โ”‚    โ‹ฎ     โ”‚     โ‹ฎ     โ”‚     โ‹ฑ
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€
                                                                                           990 columns and 999992 rows omitted
  0.001633 seconds (11.27 k allocations: 568.234 KiB)

julia> @time pretty_table(A, vcrop_mode = :middle)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ”‚    Col. 1 โ”‚   Col. 2 โ”‚     Col. 3 โ”‚     Col. 4 โ”‚    Col. 5 โ”‚   Col. 6 โ”‚   Col. 7 โ”‚   Col. 8 โ”‚   Col. 9 โ”‚  Col. 10 โ”‚   Col. โ‹ฏ
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ”‚  0.675934 โ”‚ 0.221392 โ”‚   0.859381 โ”‚ 0.00201495 โ”‚  0.656669 โ”‚ 0.419674 โ”‚ 0.116045 โ”‚ 0.555897 โ”‚ 0.189247 โ”‚ 0.552384 โ”‚  0.478 โ‹ฏ
โ”‚   0.61972 โ”‚ 0.964157 โ”‚   0.543965 โ”‚  0.0924698 โ”‚  0.408849 โ”‚  0.22149 โ”‚ 0.801567 โ”‚ 0.273067 โ”‚ 0.185251 โ”‚ 0.670841 โ”‚  0.171 โ‹ฏ
โ”‚ 0.0341166 โ”‚ 0.550614 โ”‚    0.62682 โ”‚  0.0991155 โ”‚  0.435398 โ”‚ 0.676617 โ”‚ 0.109501 โ”‚ 0.620581 โ”‚  0.92127 โ”‚ 0.560164 โ”‚  0.854 โ‹ฏ
โ”‚   0.86105 โ”‚ 0.587744 โ”‚    0.25295 โ”‚   0.342427 โ”‚  0.602571 โ”‚ 0.524927 โ”‚ 0.893778 โ”‚ 0.925155 โ”‚ 0.571104 โ”‚ 0.736807 โ”‚ 0.0752 โ‹ฏ
โ”‚     โ‹ฎ     โ”‚    โ‹ฎ     โ”‚     โ‹ฎ      โ”‚     โ‹ฎ      โ”‚     โ‹ฎ     โ”‚    โ‹ฎ     โ”‚    โ‹ฎ     โ”‚    โ‹ฎ     โ”‚    โ‹ฎ     โ”‚    โ‹ฎ     โ”‚     โ‹ฎ  โ‹ฑ
โ”‚  0.874067 โ”‚ 0.689295 โ”‚   0.969623 โ”‚   0.940648 โ”‚  0.932225 โ”‚ 0.769949 โ”‚ 0.394852 โ”‚ 0.600234 โ”‚ 0.740254 โ”‚  0.36743 โ”‚  0.870 โ‹ฏ
โ”‚  0.489394 โ”‚  0.19652 โ”‚   0.881438 โ”‚    0.29382 โ”‚  0.890437 โ”‚ 0.330823 โ”‚ 0.139547 โ”‚ 0.814829 โ”‚ 0.769702 โ”‚ 0.584777 โ”‚  0.882 โ‹ฏ
โ”‚   0.93291 โ”‚ 0.204729 โ”‚   0.236622 โ”‚  0.0458418 โ”‚  0.251297 โ”‚ 0.815881 โ”‚ 0.404949 โ”‚ 0.303269 โ”‚ 0.749317 โ”‚ 0.827221 โ”‚  0.893 โ‹ฏ
โ”‚  0.587404 โ”‚ 0.911563 โ”‚   0.193175 โ”‚   0.153903 โ”‚  0.638026 โ”‚ 0.426905 โ”‚ 0.358063 โ”‚ 0.860344 โ”‚ 0.108626 โ”‚ 0.651241 โ”‚  0.444 โ‹ฏ
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
                                                                                           990 columns and 999992 rows omitted
  0.001556 seconds (11.50 k allocations: 571.672 KiB)

(Notice that those are the second run of the command)

13 Likes

Itโ€™s quite a while I am monitoring juliaโ€™s dataframes package but anytime I wanted to use it for a project I hit lack-of-features wall. as examples I frequently need to pivot_long_to_wide or visa versa but nothing was available in dataframes. also functionality which I often need is to non-equi join dataframes which it wasnโ€™t there. but Iโ€™m happy to see both of them in this announcement. to be frank, for me a practical solution with more features is more important than obsession with speed or abstractness.

5 Likes

Many thanks for the โ€œhome madeโ€ algo and implementation. IMHO it is good to the community.

Currently no need to worry about changing several Base functions when missing involved because given the long time of testing (and discussion etc.) DataFrames.jl will still be #1 choice for most (potential) users.

Iโ€™ll set aside some free time to learn/test this new package. So thanks again.

1 Like

The main problem with piracy isnโ€™t really that it is surprising to the immediate users, since they can always read the docs (assuming it is mentioned there); the problem is that it is surprising to users who donโ€™t know they are using the package when it is a dependency of a dependency of a dependency etc. Currently, if any package in a userโ€™s full dependency tree uses InMemoryDatasets, then the behavior of Base functions changes everywhere โ€“ and they very well might not know they are depending on InMemoryDatasets! (Consider the user who hasnโ€™t seen this Discourse thread, for example). This could cause all sorts of bugs, since other code (and other packages) will expect those functions to not have been pirated. In other words, itโ€™s non-composable.

I think this makes sense given the wide-ranging consequences of type piracy, especially if we had a reliable and robust check. File an issue in RegistryCI?

23 Likes

A very nice package indeed, congrats!

The DataFrames.jl package is one of the few packages in Julia that I frequently use, and honestly is the reason that I started using Julia in the first place. I am very glad to see now there are two packages in Julia that fit my usage.

I occasionally answers questions related to DataFrames in this forum and probably I will use your package in future answers when it is appropriate. Actually, right now, Iโ€™m reading your rather comprehensive package documentation and simply enjoying it.

I admire your courage to start working on such a general type of package with those many competitors out there.

PS, the benchmarks are outstanding and I want to thank you personally for beating polars and data.table in one shot!

7 Likes

I am also reading the documentation now.

I wonder whether there is a benchmark on datasets with around a few thousand rows without multithreading as we are not always with big data. Will there be significant performance deterioration?

Also, is there a way to skip the warning below?

โ”Œ Warning: Julia started with single thread, to enable multithreaded functionalities in InMemoryDatasets.jl start Julia with multiple threads.

You can start Julia with multiple threads by setting the environmental variable JULIA_NUM_THREADS, more on this is available here and here.

I mean I normally start with single thread so I do not want to see this warning.

Just have a try on the example in the documentation but it seems that byrow does not use multithread and has no efficiency gain.

julia> Threads.nthreads()
8

julia> using InMemoryDatasets, BenchmarkTools

julia> ds = Dataset(rand(10^5, 100), :auto);

julia> m = Matrix(ds);

julia> @btime byrow(ds, sum, 1:100);
22.097 ms (171 allocations: 889.34 KiB)

julia> @btime sum(m, dims = 2);
15.696 ms (6 allocations: 879.06 KiB)

julia> @btime byrow(ds, sum, 1:100, threads = true);
22.181 ms (168 allocations: 889.17 KiB)

Thanks for you support. PrettyTables.jl is great! I am looking forward for the next big release. I have just one issue which you may want to help: printing view of a large data set. In current release it is very slow and I ended with a silly way as workaround.

4 Likes

I push a commit to the master branch. Now you can set environment variable IMD_WARN_THREADS to 0 to suppress the warning.

3 Likes

This is strange, which OS?

On Win10 and WSL2.

This is strange because I expect much more allocations for this example. I donโ€™t have access to a Windows machine, however, would you mind continuing this issue on github?

Performance benchmarks of InMemoryDatasets look great! Would also be useful to extend the size range to small tables to see what packages are feasible for use in tight loops.

Also, nice to see different API syntaxes tried out. I personally still like Base Julia map/filter/... more, but choice is good anyway.

Iโ€™m curious about the design decision to create both a new data structure, and functions that work on it (and only it). Why not define functions for one of the already existing table types instead, StructArray/TypedTable/...? Many of them are column-based and the same performance could be achieved.
The same question could potentially be asked about DataFrames.jl, but the landscape was quite different when they were created, and this may be a piece of historical baggage (or may be not).

Done.

1 Like

IMD is designed for data scientists and contains a set of functions which are very useful for data manipulation, wrangling, etc. In general, I like the syntax of base functions, but when I am working with tabular data some of those syntaxes are not intuitive and IMD has taken the bold approach to modify them to fit to a data manipulation workflow, e.g. see transpose function.

You can track/contribute this issue on github

6 Likes

The fix is uploaded to the master branch.

12 Likes

congratulations!
I have posted a question in stackoverflow, can your package row-wise help to solve it. thanks

stackoverflow - question