Intersect() is slow for large arrays


#1
julia> a=rand(1000000);

julia> b=rand(1000000);

julia> @time setdiff(a, setdiff(a, b)) # this is fast
  1.021180 seconds (77.65 k allocations: 100.853 MB, 19.28% gc time)
0-element Array{Float64,1}

julia> @time intersect(a,b) # too long to wait, have to kill it...
^C

I think the current implementation of intersect is not very efficient for daily use… The first time I use this function, I waited for a whole day for its return (just curious for its time :slight_smile: )


#2

I think that this is because in is very slow for arrays. Try

intersect(Set(a),Set(b))

#3

yes, Set is very fast!


#4

Intersecting sets of random floats is a king of odd thing to do. There’s code in intersect these days for vectors of integers that has a fast path for when the range of integer values is much smaller than the vectors:

julia> using BenchmarkTools

julia> @benchmark intersect(X,Y)
BenchmarkTools.Trial:
  samples:          47
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  80.00 bytes
  allocs estimate:  1
  minimum time:     92.69 ms (0.00% GC)
  median time:      108.12 ms (0.00% GC)
  mean time:        107.63 ms (0.00% GC)
  maximum time:     135.81 ms (0.00% GC)

julia> A = rand(1:100, 10000);

julia> B = rand(1:100, 10000);

julia> @benchmark intersect(A,B)
BenchmarkTools.Trial:
  samples:          3810
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  256.70 kb
  allocs estimate:  14
  minimum time:     911.68 μs (0.00% GC)
  median time:      1.03 ms (0.00% GC)
  mean time:        1.31 ms (5.96% GC)
  maximum time:     298.79 ms (99.61% GC)

In other cases, intersect basically just creates a Set for its smaller argument and then filters its other argument using that Set.


#5

My usecase was that, I have two lists of filenames, one of them contains more than 9M items. The random float one was a random example…


#6

See https://github.com/JuliaLang/julia/issues/13675