Intersect() is slow for large arrays

innerlee · November 20, 2016, 6:33am

julia> a=rand(1000000);

julia> b=rand(1000000);

julia> @time setdiff(a, setdiff(a, b)) # this is fast
  1.021180 seconds (77.65 k allocations: 100.853 MB, 19.28% gc time)
0-element Array{Float64,1}

julia> @time intersect(a,b) # too long to wait, have to kill it...
^C

I think the current implementation of intersect is not very efficient for daily use… The first time I use this function, I waited for a whole day for its return (just curious for its time )

Tamas_Papp · November 20, 2016, 8:28am

I think that this is because in is very slow for arrays. Try

intersect(Set(a),Set(b))

innerlee · November 20, 2016, 9:53am

yes, Set is very fast!

StefanKarpinski · November 20, 2016, 11:23am

Intersecting sets of random floats is a king of odd thing to do. There’s code in intersect these days for vectors of integers that has a fast path for when the range of integer values is much smaller than the vectors:

julia> using BenchmarkTools

julia> @benchmark intersect(X,Y)
BenchmarkTools.Trial:
  samples:          47
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  80.00 bytes
  allocs estimate:  1
  minimum time:     92.69 ms (0.00% GC)
  median time:      108.12 ms (0.00% GC)
  mean time:        107.63 ms (0.00% GC)
  maximum time:     135.81 ms (0.00% GC)

julia> A = rand(1:100, 10000);

julia> B = rand(1:100, 10000);

julia> @benchmark intersect(A,B)
BenchmarkTools.Trial:
  samples:          3810
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  256.70 kb
  allocs estimate:  14
  minimum time:     911.68 μs (0.00% GC)
  median time:      1.03 ms (0.00% GC)
  mean time:        1.31 ms (5.96% GC)
  maximum time:     298.79 ms (99.61% GC)

In other cases, intersect basically just creates a Set for its smaller argument and then filters its other argument using that Set.

innerlee · November 20, 2016, 2:15pm

My usecase was that, I have two lists of filenames, one of them contains more than 9M items. The random float one was a random example…

stevengj · November 21, 2016, 11:18am

See https://github.com/JuliaLang/julia/issues/13675

Topic		Replies	Views
Approximate set intersection for floating point numbers General Usage question	1	356	May 12, 2021
Union and intersect with unexpected return types Internals & Design bug	5	1016	August 14, 2017
Reduce vs splat with intersect() function performance Performance question	1	547	December 11, 2020
Intersect of 2 arrays New to Julia	7	1200	January 15, 2021
Iterate and intersect on BitSet vs Int Performance	11	1376	October 27, 2019

Intersect() is slow for large arrays

Related topics