Julia's in.() seems slow compared to R's %in%

Hi everyone, I want to compare if elements of vector x are in elements of vector y.

In R, I can do this as follows:

> x <- c("a", "b", "c")
> y <- c("b", "a")
> x %in% y
[1]  TRUE  TRUE FALSE

In Julia, I can do this as follows:

julia> x = ["a", "b", "c"];
julia> y = ["b", "c"];
julia> in.(x, Ref(y))
3-element BitArray{1}:
 false
  true
  true

Now let’s make the vector big:
In R,

> set.seed(123)
> x <- rnorm(100000, 1000, 2)
> y <- rnorm(100000, 1000, 2)
> system.time(x %in% y)
   user  system elapsed 
  0.009   0.005   0.015 

In Julia,

julia> using Distributions
julia> using Random
julia> Random.seed!(123);
julia> x = rand(Normal(1000, 2), 100_000);
julia> y = rand(Normal(1000, 2), 100_000);
julia> @time in.(x, Ref(y));
  5.612967 seconds (10 allocations: 16.813 KiB)

I’m not sure if I missed something. Here’s the version info of my Julia and R

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA = /Applications/Julia-1.0.app/Contents/Resources/julia/bin/julia

and R

> version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          5.3                         
year           2019                        
month          03                          
day            11                          
svn rev        76217                       
language       R                           
version.string R version 3.5.3 (2019-03-11)
nickname       Great Truth                 

Thanks.

Use something with quicker lookup, eg a Set. You pay the cost of building a hash table once, then lookup becomes O(1).

Also, I am not sure that looking up independent draws of normals is a reasonable benchmark, since theoretically the probability of a match is 0, and numerically it is very small. Try something like

x = unique(rand(1:1000, 500))
y = unique(rand(1:1000, 500))
x .∈ Ref(Set(y))
4 Likes

Good point on the unique values, but I don’t think it’s fair to require a Set to compare to R’s vector.
But, the timing with @time is wrong, as it includes precompilation time. Here’s a more comparable example, with deterministic numbers so we compare the same vectors in R and Julia:

x <- 1:1000000
y = (1:1000)^2
system.time(x %in% y)
#  bruger   system forløbet
 #  0.071    0.008    0.083
using BenchmarkTools
x = collect(1:1000_000)
y = (1:1000).^2
@btime in.($x, $Ref(y))
#  427.805 ms (7 allocations: 126.48 KiB)

R is still like 5 times faster.

Why? R uses a hash table too. See the relevant C source.

In general, I find the demand for this kind of “fairness” misguided. R’s heavily special cased and optimized C implementation (remember, R has about 5 atomic types) uses hash tables, but since they are hidden from the user, we are not supposed to compare to Julia’s hash tables, which are written in Julia and available in Base?

8 Likes

Hmm, that’s a fair point. My intention was that ignoring the cost of building the hash table would be unfair, since what the user has is a Vector. However, that cost seems negligible in this example (possibly because y is quite small):

f(x,y) =  in.(x, Ref(Set(y)))
@btime f($x, $y)
#  10.440 ms (16 allocations: 172.38 KiB)

Which is still like 8 times faster than R.

4 Likes

Does anyone understand how R decides that a table should be used? It seems like there are cases where this would be better and cases where it would be worse. Does every vector in R have a hash table associated with it? That seems… costly.

You have the condition in the code @Tamas_Papp linked to. Exactly this condition:

is for not using hashing (in short - if we disregard incomparables parameter in match it is when you have a lookup of a single value).

1 Like

As @bkamins pointed out, they special-case singleton vectors. I guess this could be fine-tuned, but if you are using R, speed is not a primary concern anyway :wink:

An elegant Julia solution could be providing an immutable hash table with Dict-like behavior, with a constructor which special-cases short collections, tune that, and use

in.(x, Ref(ImmutableSet(y)))

which would do the right thing automatically.

Perhaps I’m being dense, but I don’t understand how this is making the R code faster than what Julia is doing. If they’re implicitly creating a hash table of the RHS, we can, of course, do that too, but it’s very expensive so you want to either do it explicitly (the typical Julian approach), or do it implicitly and then cache that expensive-to-construct hash table somewhere. Is that what R is doing here for better speed? Creating a hash table of a vector the first time it appears on the RHS of %in% and then reusing that hash table for future %in% queries?

I’d do it like this and add some heuristics of size (do it for more than N entries in vector).

in.(x, Ref(y)) is (glossing over things) syntactic sugar for

[in(x[1], y), in(x[2], y), in(x[3], y), ...]

which makes it tricky to built up any persistent datastructures since the calls are independent. While it might be possible to do some optimization to match this specific pattern, it does seems like it is not too bad to have to convert y to a Set.

Isn’t it just that they have a vectorized in and thus know that there will be multiple lookup towards the RHS and as soon as there is more than one they decide to use a hash table.

We are slower in the case:

where you do a lot of repeated lookups and as @kristoffer.carlsson noted R builds hash table only once because it is a single call of match function.

But if we cached a hash table someone looked up in the vector, eg with in, then the many in calls would be fast. Maybe not so costly.
I think the result would be like an AcceleratedArray?

To be clear, we are not going to do this. In Julia, a vector is a vector and not secretly a hash table. If you want to use a hash table lookup, construct a Set object explicitly. I’m just trying to understand how R is getting better performance here.

6 Likes

We do “upgrade” to Set heuristically for certain operations, see use empirically found threshold to speed up issubset by creating Set … by twistedcubic · Pull Request #26198 · JuliaLang/julia · GitHub for an example.
The crux here is that this uses dot syntax which gets turned into multiple independent calls. A vectorized version (like R has) would be easy to special case in the same manner as R.

1 Like

Oh, I finally get it. R automatically uses the hash table implementation when the LHS and RHS are both vectors since it knows it’s going to be searching multiple times. Makes sense. I’m not sure how clever we’d want to be in this case, especially since all you have to do to get the hash table implementation is write in.(x, Set(y)) instead of in.(x, y).

7 Likes

We don’t need to be clever at all, since there is no reason to make this a function in Base and optimize it. R (and other languages) which suffer from the two-language problem have a lot of these combinations implemented in C and optimized as special cases, we don’t need to.

We have the building blocks, the user can construct fast solutions very easily.

What we can do is make the transition from using canned building blocks to programming simple things easier.

9 Likes

Yes, it honestly did surprise me that the conversion was essentially free. I would only have thought of using the Set if I were deliberately optimising the code, not the first time around

1 Like

It is syntactic sugar for

julia> Meta.@lower in.(x, Ref(y))
:($(Expr(:thunk, CodeInfo(
1 ─ %1 = Ref(y)
│   %2 = (Base.broadcasted)(in, x, %1)
│   %3 = (Base.materialize)(%2)
└──      return %3
))))

If we we really wanted to, nobody would prevent us from adding a new method for materializing broadcasts of typeof(in) between AbstractArray and RefValue{<:AbstractArray}. Nobody would prevent us from having this new method check lengths of lhs and rhs and selecting entirely different algorithms based on that.

That being said, I completely agree that this is a bad idea (not worth the code complexity). Users that want subquadratic performance should write in.(x, Ref(Set(y))), and users that write in.(x, Ref(y)) explicitly asked for quadratic runtime. This is different from R insofar as julia prides itself on transparency, and this kind of “optimization” would be very weird (e.g. the user needs to work around a bug because his types don’t have good hash functions, maybe because their isequal is due to factoring out some weird relation).

3 Likes

I actually do broadcasted in quite frequently, so I guess it would have been helpful to know that I should be doing in.(x,Set(y)) for better performance.