Julia's in.() seems slow compared to R's %in%

alasaadstat · May 6, 2019, 7:06am

Hi everyone, I want to compare if elements of vector x are in elements of vector y.

In R, I can do this as follows:

> x <- c("a", "b", "c")
> y <- c("b", "a")
> x %in% y
[1]  TRUE  TRUE FALSE

In Julia, I can do this as follows:

julia> x = ["a", "b", "c"];
julia> y = ["b", "c"];
julia> in.(x, Ref(y))
3-element BitArray{1}:
 false
  true
  true

Now let’s make the vector big:
In R,

> set.seed(123)
> x <- rnorm(100000, 1000, 2)
> y <- rnorm(100000, 1000, 2)
> system.time(x %in% y)
   user  system elapsed 
  0.009   0.005   0.015

In Julia,

julia> using Distributions
julia> using Random
julia> Random.seed!(123);
julia> x = rand(Normal(1000, 2), 100_000);
julia> y = rand(Normal(1000, 2), 100_000);
julia> @time in.(x, Ref(y));
  5.612967 seconds (10 allocations: 16.813 KiB)

I’m not sure if I missed something. Here’s the version info of my Julia and R

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA = /Applications/Julia-1.0.app/Contents/Resources/julia/bin/julia

and R

> version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          5.3                         
year           2019                        
month          03                          
day            11                          
svn rev        76217                       
language       R                           
version.string R version 3.5.3 (2019-03-11)
nickname       Great Truth

Thanks.

Tamas_Papp · May 6, 2019, 7:18am

Use something with quicker lookup, eg a Set. You pay the cost of building a hash table once, then lookup becomes O(1).

Also, I am not sure that looking up independent draws of normals is a reasonable benchmark, since theoretically the probability of a match is 0, and numerically it is very small. Try something like

x = unique(rand(1:1000, 500))
y = unique(rand(1:1000, 500))
x .∈ Ref(Set(y))

mkborregaard · May 6, 2019, 7:42am

Good point on the unique values, but I don’t think it’s fair to require a Set to compare to R’s vector.
But, the timing with @time is wrong, as it includes precompilation time. Here’s a more comparable example, with deterministic numbers so we compare the same vectors in R and Julia:

x <- 1:1000000
y = (1:1000)^2
system.time(x %in% y)
#  bruger   system forløbet
 #  0.071    0.008    0.083

using BenchmarkTools
x = collect(1:1000_000)
y = (1:1000).^2
@btime in.($x, $Ref(y))
#  427.805 ms (7 allocations: 126.48 KiB)

R is still like 5 times faster.

Tamas_Papp · May 6, 2019, 8:01am

Why? R uses a hash table too. See the relevant C source.

In general, I find the demand for this kind of “fairness” misguided. R’s heavily special cased and optimized C implementation (remember, R has about 5 atomic types) uses hash tables, but since they are hidden from the user, we are not supposed to compare to Julia’s hash tables, which are written in Julia and available in Base?

mkborregaard · May 6, 2019, 8:06am

Hmm, that’s a fair point. My intention was that ignoring the cost of building the hash table would be unfair, since what the user has is a Vector. However, that cost seems negligible in this example (possibly because y is quite small):

f(x,y) =  in.(x, Ref(Set(y)))
@btime f($x, $y)
#  10.440 ms (16 allocations: 172.38 KiB)

Which is still like 8 times faster than R.

StefanKarpinski · May 6, 2019, 1:23pm

Does anyone understand how R decides that a table should be used? It seems like there are cases where this would be better and cases where it would be worse. Does every vector in R have a hash table associated with it? That seems… costly.

bkamins · May 6, 2019, 1:34pm

You have the condition in the code @Tamas_Papp linked to. Exactly this condition:

github.com

wch/r-source/blob/56ac60cde219584e4dc46c20d063c9f88715cc12/src/main/unique.c#L910


      
              /* Coerce to a common type; type == NILSXP is ok here.
               * Note that above we coerce factors and "POSIXlt", only to character.
               * Hence, coerce to character or to `higher' type
               * (given that we have "Vector" or NULL) */
              if(TYPEOF(x) >= STRSXP || TYPEOF(table) >= STRSXP) type = STRSXP;
              else type = TYPEOF(x) < TYPEOF(table) ? TYPEOF(table) : TYPEOF(x);
              PROTECT(x	  = coerceVector(x,	type)); nprot++;
              PROTECT(table = coerceVector(table, type)); nprot++;
          
              // special case scalar x -- for speed only :
              if(XLENGTH(x) == 1 && !incomp) {
                int val = nmatch;
                int ntable = LENGTH(table);
                switch (type) {
                case STRSXP: {
          	  SEXP x_val = STRING_ELT(x,0);
          	  for (int i=0; i < ntable; i++) if (Seql(STRING_ELT(table,i), x_val)) {
          		  val = i + 1; break;
          	      }
          	  break; }
                case LGLSXP:

is for not using hashing (in short - if we disregard incomparables parameter in match it is when you have a lookup of a single value).

Tamas_Papp · May 6, 2019, 1:39pm

As @bkamins pointed out, they special-case singleton vectors. I guess this could be fine-tuned, but if you are using R, speed is not a primary concern anyway

An elegant Julia solution could be providing an immutable hash table with Dict-like behavior, with a constructor which special-cases short collections, tune that, and use

in.(x, Ref(ImmutableSet(y)))

which would do the right thing automatically.

StefanKarpinski · May 6, 2019, 1:48pm

Perhaps I’m being dense, but I don’t understand how this is making the R code faster than what Julia is doing. If they’re implicitly creating a hash table of the RHS, we can, of course, do that too, but it’s very expensive so you want to either do it explicitly (the typical Julian approach), or do it implicitly and then cache that expensive-to-construct hash table somewhere. Is that what R is doing here for better speed? Creating a hash table of a vector the first time it appears on the RHS of %in% and then reusing that hash table for future %in% queries?

lobingera · May 6, 2019, 2:00pm

I’d do it like this and add some heuristics of size (do it for more than N entries in vector).

kristoffer.carlsson · May 6, 2019, 2:11pm

in.(x, Ref(y)) is (glossing over things) syntactic sugar for

[in(x[1], y), in(x[2], y), in(x[3], y), ...]

which makes it tricky to built up any persistent datastructures since the calls are independent. While it might be possible to do some optimization to match this specific pattern, it does seems like it is not too bad to have to convert y to a Set.

Isn’t it just that they have a vectorized in and thus know that there will be multiple lookup towards the RHS and as soon as there is more than one they decide to use a hash table.

bkamins · May 6, 2019, 2:16pm

We are slower in the case:

where you do a lot of repeated lookups and as @kristoffer.carlsson noted R builds hash table only once because it is a single call of match function.

mkborregaard · May 6, 2019, 2:22pm

But if we cached a hash table someone looked up in the vector, eg with in, then the many in calls would be fast. Maybe not so costly.
I think the result would be like an AcceleratedArray?

StefanKarpinski · May 6, 2019, 2:23pm

To be clear, we are not going to do this. In Julia, a vector is a vector and not secretly a hash table. If you want to use a hash table lookup, construct a Set object explicitly. I’m just trying to understand how R is getting better performance here.

kristoffer.carlsson · May 6, 2019, 2:28pm

We do “upgrade” to Set heuristically for certain operations, see use empirically found threshold to speed up issubset by creating Set … by twistedcubic · Pull Request #26198 · JuliaLang/julia · GitHub for an example.
The crux here is that this uses dot syntax which gets turned into multiple independent calls. A vectorized version (like R has) would be easy to special case in the same manner as R.

StefanKarpinski · May 6, 2019, 2:33pm

Oh, I finally get it. R automatically uses the hash table implementation when the LHS and RHS are both vectors since it knows it’s going to be searching multiple times. Makes sense. I’m not sure how clever we’d want to be in this case, especially since all you have to do to get the hash table implementation is write in.(x, Set(y)) instead of in.(x, y).

Tamas_Papp · May 6, 2019, 2:37pm

We don’t need to be clever at all, since there is no reason to make this a function in Base and optimize it. R (and other languages) which suffer from the two-language problem have a lot of these combinations implemented in C and optimized as special cases, we don’t need to.

We have the building blocks, the user can construct fast solutions very easily.

What we can do is make the transition from using canned building blocks to programming simple things easier.

mkborregaard · May 6, 2019, 2:49pm

Yes, it honestly did surprise me that the conversion was essentially free. I would only have thought of using the Set if I were deliberately optimising the code, not the first time around

foobar_lv2 · May 6, 2019, 5:13pm

It is syntactic sugar for

julia> Meta.@lower in.(x, Ref(y))
:($(Expr(:thunk, CodeInfo(
1 ─ %1 = Ref(y)
│   %2 = (Base.broadcasted)(in, x, %1)
│   %3 = (Base.materialize)(%2)
└──      return %3
))))

If we we really wanted to, nobody would prevent us from adding a new method for materializing broadcasts of typeof(in) between AbstractArray and RefValue{<:AbstractArray}. Nobody would prevent us from having this new method check lengths of lhs and rhs and selecting entirely different algorithms based on that.

That being said, I completely agree that this is a bad idea (not worth the code complexity). Users that want subquadratic performance should write in.(x, Ref(Set(y))), and users that write in.(x, Ref(y)) explicitly asked for quadratic runtime. This is different from R insofar as julia prides itself on transparency, and this kind of “optimization” would be very weird (e.g. the user needs to work around a bug because his types don’t have good hash functions, maybe because their isequal is due to factoring out some weird relation).

tbeason · May 6, 2019, 5:42pm

I actually do broadcasted in quite frequently, so I guess it would have been helpful to know that I should be doing in.(x,Set(y)) for better performance.

Topic		Replies	Views
Fast in.(x, Ref(y)) Performance	4	520	May 18, 2020
Some_thing in some_list New to Julia question	11	489	January 29, 2024
Can I speed up this recursive set implementation? Performance performance , set , autodiff , nerdsnipe	38	714	May 10, 2024
Confusing performance results from simple conditional code New to Julia	17	523	May 12, 2023
What is Ref? New to Julia broadcasting	26	11958	January 19, 2022

Julia's in.() seems slow compared to R's %in%

Related topics