Fastest way to `unique!`

abulak · October 7, 2020, 8:13pm

Problem formulation

I have an array old::Vector{T} of somewhat complicated structs; I have an array new::Vector{T} of other complicated structs. I want an array result::Vector{T} which

contains a single element for each == equivalence class for elements from old and new, and
preserves the order, i.e. all elements from old proceed in result elements from new; I’d be happiest if the order of old is preserved.

Potential solution

These requirements are realized e.g. by unique!([old; new]), however the problem with unique! is that true equality is rather expensive for structs of type T. Also note that usually new is much (30-100×) larger than old. In terms of sizes: length(old) ~ 100_000, length(new)~10_000_000;

What have I tried

Solution I came up with so far:

pre-hashing: compute some invariants for all objects from old and new, hash them, store hash in the objects (as a field of struct T). Then comparison of saved hashes is used to determine if objects are different; this pre-hashing can be threaded at will and does not blow up the size of those arrays (those invariants are too large to be stored).

Unfortunately (potential) hash collisions or equal objects are still making (single-threaded) unique! slow (on the positive side: my hashing of invariants is perfect so far

I’m looking for

a clever way to multithread unique!, or
a different scheme to achieve similar result, or
(my favourite) for someone to surprise me with a clever solution I’d never think of

tomerarnon · October 7, 2020, 8:29pm

Did you define == yourself on these types? If so, or if you have the option to, I would try to make it cheaper (i.e. looking at fields that are likely to be different early in the comparison, comparing small fields first, etc.). Then, maybe this can help:

unique(Iterators.flatten((old, new)))

In place unique! doesn’t seem to be that helpful in this case anyway, since the very act of [old; new] seems like an unavoidable allocation.

abulak · October 7, 2020, 8:48pm

I did defined == for T and it’s a costly function; the point is that the == relation may identify objects which have completely different presentation, so it’s definitely not correct to compare fields.

Thanks for the tip with unique(Iterators.flatten(...)), but unfortunately it is slower than my use of unique!. Maybe a pseudocode will help to better illustrate the problem. Here produce_new(x,y) is a function that takes two structs and mixes them together to produce a new one.

function generate_thr(S::AbstractVector{T}, produce_new; n=4) where T
    old = unique!([one(first(S)), S...])
    sizes = [length(old)]
    new = similar(old, 0)

    # multi-threaded part:
    for r in 2:n
        resize!(new, (length(old)-1)*length(S))
        for old_idx in 1:length(old)-1
            b = old[old_idx+1]
            Threads.@threads for S_idx in 1:length(S)
                g = produce_new(b, S[S_idx])
                hash(g) # stores hash of invariants in g
                new[length(S)*(old_idx-1)+S_idx] = g
            end
        end

        # single-threaded part:
        append!(old, new)
        resize!(new, 0) # to allow GC to free memory, if necessary
        unique!(old)
        # old = unique(Iterators.flatten((old, new)))
        push!(sizes, length(old))
    end
    return old, sizes
end

Oscar_Smith · October 7, 2020, 8:49pm

A simple way to relatively effectively multi thread this is to make n sets each containing 1/nth of the list, and compute their union.

abulak · October 7, 2020, 8:56pm

I was thinking about partitioning [old; new] into more parts and computing unions in tree-like reduction (which would produce a nice threading opportunities with Threads.@spawn), but couldn’t come up with something working fast ;/ that said I don’t have much experience with the new threading interface…

Henrique_Becker · October 7, 2020, 9:45pm

If I am not wrong, unique! is faster if the elements are sorted (even with the overhead of sorting, it is better to call sort! and then unique! than just unique!). This raises two questions:

Can you define an ordering for your type? It does not need to make much semantic sense, only needs to be consistent and, preferably, fast to compute.
Do you need the elements in both Vectors to stay in the order they already were?
If you can sort each list individually, then you can combine them while removing duplicates in O(n).

abulak · October 7, 2020, 9:52pm

that’s a good idea with sorting; I have an ordering, but it’s not consistent with [old; new] requirement; but maybe this could be tweaked…
I’d prefer them to, but that’s not a strict requirement

following the fib example:

function my_unique(itr, threshold=256)
    length(itr) < threshold && return unique!(itr)
    k = length(itr)÷2
    right = Threads.@spawn my_unique(itr[k+1:end], threshold)
    return union!(my_unique(itr[1:k], threshold), fetch(right))
end

is this the pattern I should follow?

Tamas_Papp · October 9, 2020, 10:41am

Use an OrderedSet, only insert new elements, collect when done.

also preserves order AFAIK and has a cool API to avoid hash table lookups twice.

tkf · October 9, 2020, 4:32pm

FYI ThreadsX.jl has ThreadsX.unique. It should work with many iterator transforms including Iterator.flatten.

Here is a comparison of Base.unique ("base") and ThreadsX.unique ("tx") ran on a two-core machine (some random machine on GitHub Actions):

ID time GC time memory allocations

… … … … …

["unique", "rand(1:10, 1000000)", "base"] 9.666 ms (5%) 832 bytes (1%) 8

["unique", "rand(1:10, 1000000)", "tx"] 5.080 ms (5%) 50.98 KiB (1%) 882

["unique", "rand(1:1000, 1000000)", "base"] 8.653 ms (5%) 65.95 KiB (1%) 27

["unique", "rand(1:1000, 1000000)", "tx"] 5.377 ms (5%) 1.07 MiB (1%) 1186

— https://github.com/tkf/ThreadsX-data/blob/benchmark-results/2020/10/09/041909/result.md#benchmark-report-for-homerunnerworkthreadsxjlthreadsxjl-1

Here is the benchmark script: https://github.com/tkf/ThreadsX.jl/blob/7a98c2407a45a02d818c85729570e47c3dbccd8f/benchmark/bench_unique.jl

Elrod · October 9, 2020, 6:00pm

That link doesn’t work for me (this one does), but probably the easiest way to run the benchmark:

julia> using ThreadsX, BenchmarkTools

julia> include(joinpath(pathof(ThreadsX), "..", "..", "benchmark", "bench_unique.jl"))
2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "rand(1:10, 1000000)" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "base" => Benchmark(evals=1, seconds=5.0, samples=10000)
          "tx" => Benchmark(evals=1, seconds=5.0, samples=10000)
  "rand(1:1000, 1000000)" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "base" => Benchmark(evals=1, seconds=5.0, samples=10000)
          "tx" => Benchmark(evals=1, seconds=5.0, samples=10000)

julia> run(ans)
2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "rand(1:10, 1000000)" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "base" => Trial(6.119 ms)
          "tx" => Trial(485.582 μs)
  "rand(1:1000, 1000000)" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "base" => Trial(5.636 ms)
          "tx" => Trial(1.401 ms)

With 18 threads, I saw an over 12x improvement with 10 unique items, and a 4x improvement with 1000 unique items.

I’d expect to see bigger improvements the more expensive the comparisons.

tkf · October 9, 2020, 6:16pm

Oops. That was using the commit I haven’t pushed. It should work now. Thanks for letting me know.

Yeah, I need to improve scalability somehow. The allocator/GC could be a bottleneck for something like this but I haven’t looked at it carefully.

abulak · October 10, 2020, 12:47pm

Yes, I’ve already stumbled upon ThreadsX. I should shout-out a big THANKS to @tkf for contributing this package!

I may have one question though: at the moment I’m doing

new = ThreadsX.collect((g = produce_new(b,s); hash(g); g) for Iterators.product(old, S))
append!(old, new)
old = ThreadsX.unique(old)

does ThreadsX.unique come with the warranty of preserving the order of elements in old?
i.e. if I know that unique(old) == old can I assume that

old_copy = deepcopy(old)
append!(old, new)
old = ThreadsX.unique(old)
@assert old[1:length(old_copy)] == old_copy

?

abulak · October 10, 2020, 1:48pm

Thanks for running those;
for me the real problems start at sizes one can not @benchmark;
on my i9-9900X:

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9900X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 10

I get these timings:

# Base.collect; Base.unique
 22.669670 seconds (251.44 M allocations: 11.994 GiB, 24.77% gc time)
# ThreadsX.collect; Base.unique
 12.160706 seconds (252.21 M allocations: 12.222 GiB, 32.47% gc time)
# ThreadsX.collect; ThreadsX.unique
 17.786907 seconds (421.10 M allocations: 15.169 GiB, 44.75% gc time)

for this test sizes = [49, 1777, 57725, 1777541], so the most costly unique is computed on the array of 2743229 elements with 1777541 uniques.

also for bi-threaded == I get

# Base.collect; Base.unique
 30.702327 seconds (263.90 M allocations: 13.323 GiB, 23.79% gc time)
# ThreadsX.collect; Base.unique
 20.626636 seconds (263.81 M allocations: 13.509 GiB, 33.23% gc time)
# ThreadsX.collect; ThreadsX.unique
 21.210533 seconds (433.29 M allocations: 16.518 GiB, 42.18% gc time)

so probably my handling of @spawn in == is still not correct

abulak · October 10, 2020, 2:16pm

@tkf if I may have just one small gripe for ThreadsX?
it’s the length of its deps Most of them seem ok, but DelimitedFiles, Dates, ZygoteRules look out of place

   Updating registry at `~/.julia/registries/General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
Updating `~/.julia/dev/Groups/Project.toml`
  [ac1d9e8a] + ThreadsX v0.1.7
Updating `~/.julia/dev/Groups/Manifest.toml`
  [dce04be8] + ArgCheck v2.1.0
  [198e06fe] + BangBang v0.3.29
  [34da2185] + Compat v3.19.0
  [a33af91c] + CompositionsBase v0.1.0
  [187b0558] + ConstructionBase v1.0.0
  [9a962f9c] + DataAPI v1.3.0
  [e2d170a0] + DataValueInterfaces v1.0.0
  [244e2a9f] + DefineSingletons v0.1.0
  [22cec73e] + InitialValues v0.2.10
  [82899510] + IteratorInterfaceExtensions v1.0.0
  [1914dd2f] + MacroTools v0.5.5
  [128add7d] + MicroCollections v0.1.0
  [42d2dcc6] + Referenceables v0.1.0
  [ae029012] + Requires v1.1.0
  [efcf1570] + Setfield v0.7.0
  [171d559e] + SplittablesBase v0.1.10
  [3783bdb8] + TableTraits v1.0.0
  [bd369af6] + Tables v1.1.0
  [ac1d9e8a] + ThreadsX v0.1.7
  [28d57a85] + Transducers v0.4.52
  [700de1a5] + ZygoteRules v0.2.0
  [ade2ca70] + Dates
  [8bb1440f] + DelimitedFiles
  [9fa8497b] + Future
  [76f85450] + LibGit2
  [a63ad114] + Mmap
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [ea8e919c] + SHA
  [1a1011a3] + SharedArrays
  [10745b16] + Statistics
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode

tbeason · October 10, 2020, 3:46pm

ThreadsX does not depend on any of those 3 packages you listed. Likely another package in your Project.toml did, or they are secondary dependencies (dependencies of packages that ThreadsX does depend on). Two of the three are stdlib packages though, so not usually a concern with those anyway.

tkf · October 11, 2020, 12:54am

It looks like Dates and DelimitedFiles come from Compat. But I wouldn’t worry about them since they are stdlib. ZygoteRules comes from BangBang and it was kind of nice thing to have for supporting differentiable maybe-inplace functions: [ANN] BangBang.jl: a compatible layer for mutable and immutable data structures (bonus: Zygote.jl support). Minimalistic interface packages like ZygoteRules are the foundation of interoperability in the Julia ecosystem and I would rather cheer that it’s in the dependency of any package (though I should be switching to ChainRulesCore).

Anyway, the heaviest dependency of ThreadsX is Transducers and I really should be dissolving Transducers to smaller packages. Until then, I think everything else in the dependency chain is negligible.

tkf · October 11, 2020, 5:12am

Yeah, the result of ThreadsX.unique is supposed to be identical as long as the user function does not have data races (and there are no relevant bugs in ThreadsX).

abulak · October 11, 2020, 5:33pm

thanks!

ID	time	GC time	memory	allocations
…	…	…	…	…
`["unique", "rand(1:10, 1000000)", "base"]`	9.666 ms (5%)		832 bytes (1%)	8
`["unique", "rand(1:10, 1000000)", "tx"]`	5.080 ms (5%)		50.98 KiB (1%)	882
`["unique", "rand(1:1000, 1000000)", "base"]`	8.653 ms (5%)		65.95 KiB (1%)	27
`["unique", "rand(1:1000, 1000000)", "tx"]`	5.377 ms (5%)		1.07 MiB (1%)	1186