Complexity/implementation of disjoint-set (union-find) in `DataStructures.jl`

I didn’t see it in the docs and am trying to identify what algorithm is used in the implementation of find (and union) in the DisjointSets object in the DataStructures.jl package.

find
The docs say that “path compression” is used for the find operation, which is good enough for (optimal) O(\alpha(n)) complexity, but requires two traversals of a path from node to root. In contrast, the Wikipedia entry notes that path-halving or path-splitting can be used to remove one of the traversals and still retain the optimal complexity (and is fast in practice).

Two questions: (1) am I correct in my understanding that the current implementation does not utilize path-halving or path-splitting? (2) would it be worthwhile to have an implementation of path-compression plus path-halving and path-splitting find operations?

union
It seems the current implementation does union-by-rank (per Wikipedia). By-rank seems to be more space-efficient than by-size.

find

  1. Yes, you are correct. The current implementation does not have path-halving optimisation.
  2. Implementing path compression can be a good PR. If possible, follow-up with benchmarks proving the find operation latency improvement

union
If you are interested, you can read about more union techniques in this paper. Intuitively, union by rank leads to smaller length of chain than union by size, hence that is used I guess.

Path compression is already implemented.

I think the most important performance improvement would be to write this as a loop instead of recursion, and then reconsider whether storing ranks is really required (the recursive implementation needs ranks to avoid stack-overflows).

Otherwise, I guess some documentation and soul-searching would be in order, regarding thread-safety of in_same_set and find_root!.

It is obviously not admissible to modify the union-find structure concurrently with anything, but it is less obvious whether two threads can safely query in_same_set concurrently. (i.e. it is obviously forbidden and undefined-behavior per LLVM. But the compiler should probably still generate correct thread-safe code with benign data races…)

1 Like

Here’s an implementation of the parallel version:

It follows precisely an article of Jayanti and Tarjan and uses the newest (julia-1.12) @atomic[...] functionality.

@jacob-roth I didn’t spend any time fine-tuning this, feel free to dive in (it’s a direct translation of the paper) play and maybe open a PR. I’d be happy to move this to DataStructures or elsewhere once performance is okeyish. Currently

  • the concurrent version overtakes the sequential around N=2^10 with 2N of unions (chosen uniformly at random)
  • the concurrent version overtakes the DataStructures.IntDisjointSets around N=2^15.

These are benchmarks for (concurrent, sequential, baseline) and powers of 2 from 2:20

 (2.836 μs, 40.000 ns, 40.000 ns)
 (2.636 μs, 80.000 ns, 60.000 ns)
 (2.675 μs, 170.000 ns, 110.000 ns)
 (2.695 μs, 340.000 ns, 200.000 ns)
 (3.257 μs, 691.000 ns, 401.000 ns)
 (4.750 μs, 1.523 μs, 791.000 ns)
 (6.323 μs, 3.136 μs, 1.623 μs)
 (9.359 μs, 6.393 μs, 3.167 μs)
 (13.678 μs, 13.888 μs, 6.373 μs)
 (22.897 μs, 29.150 μs, 13.417 μs)
 (46.094 μs, 65.373 μs, 27.927 μs)
 (89.522 μs, 156.336 μs, 58.126 μs)
 (150.341 μs, 613.329 μs, 122.872 μs)
 (305.499 μs, 1.371 ms, 618.062 μs)
 (746.811 μs, 2.967 ms, 1.364 ms)
 (1.581 ms, 6.757 ms, 3.211 ms)
 (3.096 ms, 18.163 ms, 7.306 ms)
 (5.862 ms, 43.594 ms, 16.162 ms)
 (11.734 ms, 100.096 ms, 56.273 ms)
3 Likes

This is the current master, with a few small bugs in compression fixed and a slightly different benchmark using Polyester for lighter threading.

 in_same_set(ds::DataStructures.IntDisjointSets, x, y) = DS.in_same_set(ds, x, y)
in_same_set(ds::CDS.AbstractDisjointSet, x, y) = CDS.same_set(ds, x, y)

function union_bench(ds, r)
    ans = [true for _ in axes(r, 2)]
    for i in axes(r, 2)
        x, y = r[1, i], r[2, i]
        union!(ds, x, y)
        ans[i] = in_same_set(ds, x, y)
    end
    @assert all(ans)
    return ds
end

function union_bench_poly(ds, r)
    ans = [true for _ in axes(r, 2)]
    Polyester.@batch minbatch = 2^7 for i in axes(r, 2)
        x, y = r[1, i], r[2, i]
        union!(ds, x, y)
        ans[i] = in_same_set(ds, x, y)
    end
    @assert all(ans)
    return ds
end

timings = map(2:1:20) do N
    Random.seed!(1234)
    r = rand(1:2^N, 2, 2^(N - 1))
    cds = @benchmark union_bench_poly(x, $r) setup =
        (x = CDS.ConcurrentDisjointSet($(2^N))) evals = 1
    cds1 = @benchmark union_bench(x, $r) setup =
        (x = CDS.ConcurrentDisjointSet($(2^N))) evals = 1
    ds = @benchmark union_bench(x, $r) setup = (x = CDS.DisjointSet($(2^N))) evals = 1
    baseline = @benchmark union_bench(x, $r) setup =
        (x = DS.IntDisjointSets($(2^N))) evals = 1
    # @info N median(cds) median(ds) median(baseline)
    return cds, cds1, ds, baseline
end

results

(threaded execution, concurrent ds + serial execution,  serial ds, baseline (DataStructures)
 (120.000 ns, 120.000 ns, 80.000 ns, 60.000 ns)
 (240.000 ns, 240.000 ns, 140.000 ns, 100.000 ns)
 (440.000 ns, 450.000 ns, 270.000 ns, 180.000 ns)
 (871.000 ns, 881.000 ns, 561.000 ns, 341.000 ns)
 (2.183 μs, 1.863 μs, 1.192 μs, 691.000 ns)
 (4.608 μs, 3.576 μs, 2.455 μs, 1.352 μs)
 (5.250 μs, 7.143 μs, 5.170 μs, 2.724 μs)
 (8.285 μs, 14.276 μs, 10.339 μs, 5.550 μs)
 (15.378 μs, 28.862 μs, 22.009 μs, 11.371 μs)
 (27.640 μs, 57.544 μs, 45.713 μs, 23.373 μs)
 (51.463 μs, 119.898 μs, 100.753 μs, 49.189 μs)
 (96.475 μs, 244.163 μs, 350.346 μs, 103.848 μs)
 (181.830 μs, 665.529 μs, 926.805 μs, 325.151 μs)
 (366.777 μs, 1.431 ms, 1.993 ms, 849.415 μs)
 (816.837 μs, 3.060 ms, 4.495 ms, 1.920 ms)
 (1.693 ms, 6.773 ms, 10.365 ms, 4.345 ms)
 (3.497 ms, 15.192 ms, 24.507 ms, 9.430 ms)
 (6.849 ms, 32.876 ms, 55.914 ms, 21.628 ms)
 (13.822 ms, 72.533 ms, 129.908 ms, 78.798 ms)
  • concurrent faster than sequential for N > 2^7;
  • concurrent faster than baseline for N > 2^13;
  • nice scaling on 8 threads: up to 5 times faster N = 2^20;
  • there’s something wrong with the serial data structure as it becomes 2 times slower than it should (the concurrent in single-threaded execution! is 2× faster).
1 Like