Yes, it is similar to asof
join in pandas. It does similar job with a few more options.
A very nice package. Congratulations!
Why is the second run faster also for polars, which I think is written in Rust? Is there some caching going on?
This package looks on the surface to be almost a reimplementation of DataFrames.jl. Can you elaborate on why your improvements required a separate package? The basic principles should be the same – both packages deal with general column-oriented tables.
My understanding is that the package:
- was a fresh re-write (EDIT: after reading the source codes of the package it seems it took the DataFrames.jl sources that the creator liked and dropped parts that were baggage), so it does not have a baggage of not breaking things we have in DataFrames.jl.
- it currently makes more assumptions what data it can store/process and uses these assumptions in the algorithms (DataFrames.jl is designed to store anything that is valid Julia “as is”). Of course in the future maybe these restrictions would be lifted.
An example of the second point:
julia> name = Dataset(ID = vcat.([1, 2, 3]), Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 Dataset
Row │ ID Name
│ identity identity
│ Array…? String?
─────┼─────────────────────
1 │ [1] John Doe
2 │ [2] Jane Doe
3 │ [3] Joe Blogs
julia> job = Dataset(ID = vcat.([1, 2, 2, 4]), Job = ["Lawyer", "Doctor", "Florist", "Farmer"])
4×2 Dataset
Row │ ID Job
│ identity identity
│ Array…? String?
─────┼────────────────────
1 │ [1] Lawyer
2 │ [2] Doctor
3 │ [2] Florist
4 │ [4] Farmer
julia> leftjoin(name, job, on = :ID)
ERROR: MethodError: Cannot `convert` an object of type Vector{Int64} to an object of type Integer
julia> leftjoin(DataFrame(name), DataFrame(job), on = :ID)
4×3 DataFrame
Row │ ID Name Job
│ Array… String String?
─────┼────────────────────────────
1 │ [1] John Doe Lawyer
2 │ [2] Jane Doe Doctor
3 │ [2] Jane Doe Florist
4 │ [3] Joe Blogs missing
I don’t understand the internals well enough, but assuming that your point here is that leftjoin
in InMemoryDatasets
squeezes out extra performance by restricting the valid types of index columns to join on, would you consider this a missing optimization in DataFrames which could be filled in with multiple dispatch providing a “fast path” leftjoin
for certain column types?
Yes.
I don’t know about polars
, however, to be clear, the second run is not always faster for polars
.
leftjoin
by default uses the sort
method, for situation that sort
method is not well defined user should use hash
method for joining, thus:
julia> leftjoin(name, job, on = :ID, method = :hash)
4×3 Dataset
Row │ ID Name Job
│ identity identity identity
│ Array…? String? String?
─────┼───────────────────────────────
1 │ [1] John Doe Lawyer
2 │ [2] Jane Doe Doctor
3 │ [2] Jane Doe Florist
4 │ [3] Joe Blogs missing
Oh, I must have read it wrong
Both packages are for data manipulation, but on surface and internally they are very different.
internal The algorithms in IMD
build from scratch for columnnar tables and the way Julia
works. Most of these algorithms are home made to fit some criteria that I had in mind and you wouldn’t find them anywhere else.
on surface I mentioned some differences in the announcement, however, those are just few of them. I provided more details of the IMD
features in its documentation. I tried to keep the syntax of IMD
familiar to DataFrames
users but it doesn’t mean IMD
uses the same syntax as DataFrames
; some places they just use similar name for functions but the syntax is very different, like filter
, some places they use similar name with similar syntax but different options, like unique
.
The most significant difference from my perspective is that InMemoryDatasets.jl uses the strategy of skipping missing values by default. In contrast to DataFrames.jl, InMemoryDatasets.jl
- skips missing values in aggregation functions over its
Dataset
types, and - skips missing values in aggregation functions over all types, by pirating Base’s aggregations.
For example in Base Julia,
julia> maximum([1,1,missing])
missing
but with using InMemoryDatasets
julia> maximum([1,1,missing])
1
This is a very cool package!
The only thing I would very strongly recommend is to not do this:
Changing the semantics of functions from Base in such a fundamental way is really considered bad practice. It is super confusing for users, and it can introduce the most unfortunate bugs for users without them ever being aware of it. If I had my way, I would actually not allow registration of packages that do things like that in the general registry
I think if you aren’t happy with the semantics of Missing
s in base (and I have quite a bit of sympathy for that), you either need to define new functions that behave the way you want or use a different type for missing values that is under your control.
Congratulations! it is very very nice package. I was immediately sold with the first feature in your list . As a data scientists I was avoiding Julia
as the first choice due to the lack of practical data manipulation tool, but I guess your package is changing everything for me
I forgot to mention that I love the way you treat missing values, please please keep it this wayit simplifies my workflow significantly.
What? Could you please ellaborate.
Congratulations @sl-solution ! It seems a very good package
Btw, I saw you are using PrettyTables.jl to print the data! Please, feel free to ping me if you need some feature or, specially, if I break something PrettyTables.jl is passing for a huge rewrite that will greatly increase its performance (time to print the first table is down by almost 50%). I am trying as hard as I can to avoid breaking changes, but it can happen. I will remember to check the interoperability with your package before I release v2.0.
I think the new release will fix this problem in your comments:
# Print the table with the selected options.
# currently pretty_table is very slow for large tables, the workaround is to use only few rows
Of course it will always be slow when printing the entire very big table. But it should now be very fast printing any table when cropping is enabled:
julia> A = rand(1_000_000, 1_000);
julia> @time pretty_table(A)
┌───────────┬──────────┬────────────┬────────────┬───────────┬──────────┬───────────┬───────────┬──────────┬───────────┬──────
│ Col. 1 │ Col. 2 │ Col. 3 │ Col. 4 │ Col. 5 │ Col. 6 │ Col. 7 │ Col. 8 │ Col. 9 │ Col. 10 │ C ⋯
├───────────┼──────────┼────────────┼────────────┼───────────┼──────────┼───────────┼───────────┼──────────┼───────────┼──────
│ 0.675934 │ 0.221392 │ 0.859381 │ 0.00201495 │ 0.656669 │ 0.419674 │ 0.116045 │ 0.555897 │ 0.189247 │ 0.552384 │ 0. ⋯
│ 0.61972 │ 0.964157 │ 0.543965 │ 0.0924698 │ 0.408849 │ 0.22149 │ 0.801567 │ 0.273067 │ 0.185251 │ 0.670841 │ 0. ⋯
│ 0.0341166 │ 0.550614 │ 0.62682 │ 0.0991155 │ 0.435398 │ 0.676617 │ 0.109501 │ 0.620581 │ 0.92127 │ 0.560164 │ 0. ⋯
│ 0.86105 │ 0.587744 │ 0.25295 │ 0.342427 │ 0.602571 │ 0.524927 │ 0.893778 │ 0.925155 │ 0.571104 │ 0.736807 │ 0.0 ⋯
│ 0.435085 │ 0.178483 │ 0.596313 │ 0.488782 │ 0.104792 │ 0.994904 │ 0.08668 │ 0.302552 │ 0.099019 │ 0.448827 │ 0 ⋯
│ 0.658401 │ 0.106824 │ 0.00276253 │ 0.447873 │ 0.0350634 │ 0.800669 │ 0.215574 │ 0.375465 │ 0.11485 │ 0.661147 │ 0. ⋯
│ 0.815405 │ 0.22639 │ 0.585754 │ 0.129567 │ 0.0261965 │ 0.58881 │ 0.575382 │ 0.811007 │ 0.380854 │ 0.890361 │ ⋯
│ 0.0420148 │ 0.917764 │ 0.621537 │ 0.605215 │ 0.0492217 │ 0.182624 │ 0.370627 │ 0.226672 │ 0.597551 │ 0.387021 │ 0. ⋯
│ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋱
└───────────┴──────────┴────────────┴────────────┴───────────┴──────────┴───────────┴───────────┴──────────┴───────────┴──────
990 columns and 999992 rows omitted
0.001633 seconds (11.27 k allocations: 568.234 KiB)
julia> @time pretty_table(A, vcrop_mode = :middle)
┌───────────┬──────────┬────────────┬────────────┬───────────┬──────────┬──────────┬──────────┬──────────┬──────────┬─────────
│ Col. 1 │ Col. 2 │ Col. 3 │ Col. 4 │ Col. 5 │ Col. 6 │ Col. 7 │ Col. 8 │ Col. 9 │ Col. 10 │ Col. ⋯
├───────────┼──────────┼────────────┼────────────┼───────────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────
│ 0.675934 │ 0.221392 │ 0.859381 │ 0.00201495 │ 0.656669 │ 0.419674 │ 0.116045 │ 0.555897 │ 0.189247 │ 0.552384 │ 0.478 ⋯
│ 0.61972 │ 0.964157 │ 0.543965 │ 0.0924698 │ 0.408849 │ 0.22149 │ 0.801567 │ 0.273067 │ 0.185251 │ 0.670841 │ 0.171 ⋯
│ 0.0341166 │ 0.550614 │ 0.62682 │ 0.0991155 │ 0.435398 │ 0.676617 │ 0.109501 │ 0.620581 │ 0.92127 │ 0.560164 │ 0.854 ⋯
│ 0.86105 │ 0.587744 │ 0.25295 │ 0.342427 │ 0.602571 │ 0.524927 │ 0.893778 │ 0.925155 │ 0.571104 │ 0.736807 │ 0.0752 ⋯
│ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ ⋱
│ 0.874067 │ 0.689295 │ 0.969623 │ 0.940648 │ 0.932225 │ 0.769949 │ 0.394852 │ 0.600234 │ 0.740254 │ 0.36743 │ 0.870 ⋯
│ 0.489394 │ 0.19652 │ 0.881438 │ 0.29382 │ 0.890437 │ 0.330823 │ 0.139547 │ 0.814829 │ 0.769702 │ 0.584777 │ 0.882 ⋯
│ 0.93291 │ 0.204729 │ 0.236622 │ 0.0458418 │ 0.251297 │ 0.815881 │ 0.404949 │ 0.303269 │ 0.749317 │ 0.827221 │ 0.893 ⋯
│ 0.587404 │ 0.911563 │ 0.193175 │ 0.153903 │ 0.638026 │ 0.426905 │ 0.358063 │ 0.860344 │ 0.108626 │ 0.651241 │ 0.444 ⋯
└───────────┴──────────┴────────────┴────────────┴───────────┴──────────┴──────────┴──────────┴──────────┴──────────┴─────────
990 columns and 999992 rows omitted
0.001556 seconds (11.50 k allocations: 571.672 KiB)
(Notice that those are the second run of the command)
It’s quite a while I am monitoring julia’s dataframes package but anytime I wanted to use it for a project I hit lack-of-features wall. as examples I frequently need to pivot_long_to_wide or visa versa but nothing was available in dataframes. also functionality which I often need is to non-equi join dataframes which it wasn’t there. but I’m happy to see both of them in this announcement. to be frank, for me a practical solution with more features is more important than obsession with speed or abstractness.