Julia's in.() seems slow compared to R's %in%

mkborregaard · May 6, 2019, 8:44pm

I thought one of the key ideas of the new broadcasting semantic was to allow specialized methods for broadcasted functions? I’d think caching something (such as a hash table) was a classic example of such a specialization.

StefanKarpinski · May 6, 2019, 9:35pm

It certainly can be done, but the question is whether it should be. If you think it should, open an issue on GitHub and we can discuss it. People there who might have opinions.

jeff.bezanson · May 7, 2019, 10:19pm

We construct a Set internally in indexin and findall(in(y), x). We just don’t happen to have a function that returns a boolean vector for this. One of the issues with the “vectorized” approach is that it’s hard to have a function for every combination of input and output formats you might need.

phaverty · May 8, 2019, 3:57am

The table gets thrown away after each call to %in%, which is why Martin Maechler and I put in that special case a few years back. I think Gabe Becker has some plans to keep the table around, while it is valid, as part of the ALTREP framework. FWIW, I think leaving it to the user to explicitly make the table, or not, sounds good to me.

mkborregaard · May 8, 2019, 8:23am

in.(x, Ref(y)) does appear pretty canonical, though?

foobar_lv2 · May 8, 2019, 9:47am

I still don’t like the idea of overriding slow user choices of algorithm (say: User wants to broadcast in of typevars or flint integers, and needs to work around the known quirk that equality does not imply equality of hashes). But making the canonical spelling fast is probably a one-liner

julia> using BenchmarkTools
julia> N=10_000; lhs = rand(1:N, N); rhs = rand(1:N, N);
julia> res = in.(lhs, Ref(rhs));

julia> @btime in.(lhs, Ref(rhs));
  39.737 ms (6 allocations: 5.61 KiB)

julia> Base.Broadcast.broadcasted(::typeof(in), lhs::AbstractArray, rhs::Base.RefValue{<:AbstractArray}) = Base.broadcasted(in, lhs, Ref(Set(rhs[])))

julia> res2 = in.(lhs, Ref(rhs));
julia> res==res2
true
julia> @btime in.(lhs, Ref(rhs));
  497.682 μs (15 allocations: 150.30 KiB)

If you like the idea, feel free to make a PR with that single line, such that we can discuss the merits and demerits, and @mbauman can bring out all the subtleties I missed. (also, the fact that this single line works is a testament to how well designed the broadcasting system is. Big kudos to mbauman and all the other people involved in designing it!)

mkborregaard · May 8, 2019, 11:12am

I do like the idea, will do!

Tamas_Papp · May 8, 2019, 12:23pm

I completely agree with this, but I dislike the idea of making this default.

It is not obvious to me that one always wants to build a hash table, and doing it on demand is very easy, as you and others have pointed out.

I think the right way to design APIs in Julia is not to hide the magic from the user, but to make it easily available in a composable and safe manner, so that everyone can just use it transparently. Which is what’s happening with Ref(Set(...)).

This also has the added pedagogical benefits of bringing hash tables to users who may not have hard about them. Paradoxically, a typical R user uses has tables all the time without knowing about the concept, and at the same time misses a lot of great opportunities for using them because they are well-hidden. I think that exposing them effortlessly is the proper way for Julia.

mkborregaard · May 8, 2019, 12:32pm

I disagree with this. Whether or not to construct a hash table in the middle of a broadcastet in call is an implementation detail. There is no difference in terms of the result or the type being returned to the user. Implementations are always hidden from the user.
Requiring broadcastet in on Vectors to be slow because Vectors don’t have hash tables and the user should know this seems to me an unnecessarily purist approach, and making code slow for purely pedagogical reasons seems needless.

Where I do agree completely is that Vectors should not secretly lug around hash tables, nor should there be hidden tricks that distort the user’s ability to reason about code. But the specific implementation of in does not affect this.

StefanKarpinski · May 8, 2019, 12:40pm

Man, I’m really torn on this—I can see it both ways. I would strongly suggest opening a PR or an issue to discuss.

baggepinnen · May 8, 2019, 12:50pm

One could perhaps imagine a package which, when loaded, overloads functions like the one in this discussion and prints a warning, e.g.

in.(x,Ref(y))  # takes long time
using PerformancePirates
in.(x,Ref(y)) 
┌ Warning: in.(x,Ref(y)) can be performed faster by in.(x,Ref(Set(y)))
└ @ Main REPL[1]:1

One could also imagine adding this case to Traceur.jl

Tamas_Papp · May 8, 2019, 1:00pm

Possibly, but what I would emphasize here is that it is about choosing an algorithm. Unfortunately, this is somewhat difficult to automate, which is why many Julia APIs expose these choises (even when offering a sensible default).

Calling it “making code slow” possibly misrepresents my position. It is just not applying an optimization which itself has a cost, and the break-even point is unclear and difficult to tune (Set also allocates, and not everyone wants that to happen in all contexts).

The pedagogical benefit is just an additional extra. If you think that I would want to punish users so that they learn something new, it is possible that you misunderstood what I said.

Daniel_Berge · May 8, 2019, 2:31pm

I can see both sides of this argument also. It certainly seems like a good example where the broadcasting mechanics can be utilized to improve performance.

mkborregaard · May 8, 2019, 4:20pm

I’ll do so tomorrow if nobody beats me to it. The code is already written by @foobar_lv2

affans · May 8, 2019, 4:23pm

I use in quite frequently but never considered using Ref and Set. I don’t want to go offtopic, but can someone explain quickly how Ref works? and how a Set makes it faster (i.e. how does a hash table make the lookup faster)?

mkborregaard · May 8, 2019, 4:24pm

I might have been stretching the point slightly to make the case

tkluck · May 8, 2019, 4:45pm

I do see the attractiveness of solving this at the broadcasted level, particularly because it shows off its flexibility so well. But I really do think it’s misguided to want to make this choice for the user.

For example, have a look at these timings:

julia> a = [1,2,3];

julia> b = rand(1:10, 1000);

julia> @btime in.($b, Ref($a));
  2.979 μs (3 allocations: 4.42 KiB)

julia> @btime in.($b, Ref(Set($a)));
  6.300 μs (8 allocations: 4.91 KiB)

Its hard to avoid the conclusion is that, if we construct a hash-table behind the scenes, we’d have to give the user a way to opt-out. In other words, whatever default we choose, there need to be two related patterns that allow for these two options. I honestly can’t think of any way to make this distinction clearer for our users than just what it is: the difference between a and Set(a).

(Superfluously, it’s nothing more than a lucky coincidence that in.(b, Ref(a)) can be transformed in this way at compile time, where map(in(a), b) can’t.)

tkluck · May 8, 2019, 4:48pm

One interesting thing to read is The Law of Leaky Abstractions – Joel on Software . A leaky abstraction is exactly what we see here: performance is a very typical observable effect that betrays the implementation. As such, it is never truly hidden from the user.

foobar_lv2 · May 8, 2019, 8:52pm

julia> Base.map(f::Base.Fix2{typeof(in), <:AbstractArray}, b::AbstractArray)=map(in(Set(f.x)), b)

tkluck · May 8, 2019, 9:26pm

TIL! That’s quite cool. The rest of my point still stands

Topic		Replies	Views
What is Julia's equivalent of R's %in% New to Julia	9	2837	November 21, 2019
Delete rows that exist in another data frame in Julia? New to Julia question	1	1557	May 27, 2019
Fast in.(x, Ref(y)) Performance	4	520	May 18, 2020
Is the julia compiler smart enough to treat `x in (y,)` the same as `x == y`? Internals & Design question , performance	2	616	June 15, 2022
Performance inv & solve in Julia & R Performance	15	1422	August 11, 2020

Julia's in.() seems slow compared to R's %in%

Related topics