julia> a=rand(1000000);
julia> b=rand(1000000);
julia> @time setdiff(a, setdiff(a, b)) # this is fast
1.021180 seconds (77.65 k allocations: 100.853 MB, 19.28% gc time)
0-element Array{Float64,1}
julia> @time intersect(a,b) # too long to wait, have to kill it...
^C
I think the current implementation of intersect is not very efficient for daily use… The first time I use this function, I waited for a whole day for its return (just curious for its time )
I think that this is because in
is very slow for arrays. Try
intersect(Set(a),Set(b))
1 Like
Intersecting sets of random floats is a king of odd thing to do. There’s code in intersect these days for vectors of integers that has a fast path for when the range of integer values is much smaller than the vectors:
julia> using BenchmarkTools
julia> @benchmark intersect(X,Y)
BenchmarkTools.Trial:
samples: 47
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 80.00 bytes
allocs estimate: 1
minimum time: 92.69 ms (0.00% GC)
median time: 108.12 ms (0.00% GC)
mean time: 107.63 ms (0.00% GC)
maximum time: 135.81 ms (0.00% GC)
julia> A = rand(1:100, 10000);
julia> B = rand(1:100, 10000);
julia> @benchmark intersect(A,B)
BenchmarkTools.Trial:
samples: 3810
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 256.70 kb
allocs estimate: 14
minimum time: 911.68 μs (0.00% GC)
median time: 1.03 ms (0.00% GC)
mean time: 1.31 ms (5.96% GC)
maximum time: 298.79 ms (99.61% GC)
In other cases, intersect basically just creates a Set for its smaller argument and then filters its other argument using that Set.
My usecase was that, I have two lists of filenames, one of them contains more than 9M items. The random float one was a random example…