I have an array old::Vector{T} of somewhat complicated structs; I have an array new::Vector{T} of other complicated structs. I want an array result::Vector{T} which
contains a single element for each == equivalence class for elements from old and new, and
preserves the order, i.e. all elements from old proceed in result elements from new; I’d be happiest if the order of old is preserved.
Potential solution
These requirements are realized e.g. by unique!([old; new]), however the problem with unique! is that true equality is rather expensive for structs of type T. Also note that usually new is much (30-100×) larger than old. In terms of sizes: length(old) ~ 100_000, length(new)~10_000_000;
What have I tried
Solution I came up with so far:
pre-hashing: compute some invariants for all objects from old and new, hash them, store hash in the objects (as a field of struct T). Then comparison of saved hashes is used to determine if objects are different; this pre-hashing can be threaded at will and does not blow up the size of those arrays (those invariants are too large to be stored).
Unfortunately (potential) hash collisions or equal objects are still making (single-threaded) unique! slow (on the positive side: my hashing of invariants is perfect so far
I’m looking for
a clever way to multithread unique!, or
a different scheme to achieve similar result, or
(my favourite) for someone to surprise me with a clever solution I’d never think of
Did you define == yourself on these types? If so, or if you have the option to, I would try to make it cheaper (i.e. looking at fields that are likely to be different early in the comparison, comparing small fields first, etc.). Then, maybe this can help:
unique(Iterators.flatten((old, new)))
In place unique! doesn’t seem to be that helpful in this case anyway, since the very act of [old; new] seems like an unavoidable allocation.
I did defined == for T and it’s a costly function; the point is that the == relation may identify objects which have completely different presentation, so it’s definitely not correct to compare fields.
Thanks for the tip with unique(Iterators.flatten(...)), but unfortunately it is slower than my use of unique!. Maybe a pseudocode will help to better illustrate the problem. Here produce_new(x,y) is a function that takes two structs and mixes them together to produce a new one.
function generate_thr(S::AbstractVector{T}, produce_new; n=4) where T
old = unique!([one(first(S)), S...])
sizes = [length(old)]
new = similar(old, 0)
# multi-threaded part:
for r in 2:n
resize!(new, (length(old)-1)*length(S))
for old_idx in 1:length(old)-1
b = old[old_idx+1]
Threads.@threads for S_idx in 1:length(S)
g = produce_new(b, S[S_idx])
hash(g) # stores hash of invariants in g
new[length(S)*(old_idx-1)+S_idx] = g
end
end
# single-threaded part:
append!(old, new)
resize!(new, 0) # to allow GC to free memory, if necessary
unique!(old)
# old = unique(Iterators.flatten((old, new)))
push!(sizes, length(old))
end
return old, sizes
end
I was thinking about partitioning [old; new] into more parts and computing unions in tree-like reduction (which would produce a nice threading opportunities with Threads.@spawn), but couldn’t come up with something working fast ;/ that said I don’t have much experience with the new threading interface…
If I am not wrong, unique! is faster if the elements are sorted (even with the overhead of sorting, it is better to call sort! and then unique! than just unique!). This raises two questions:
Can you define an ordering for your type? It does not need to make much semantic sense, only needs to be consistent and, preferably, fast to compute.
Do you need the elements in both Vectors to stay in the order they already were?
If you can sort each list individually, then you can combine them while removing duplicates in O(n).
@tkf if I may have just one small gripe for ThreadsX?
it’s the length of its deps Most of them seem ok, but DelimitedFiles, Dates, ZygoteRules look out of place
ThreadsX does not depend on any of those 3 packages you listed. Likely another package in your Project.toml did, or they are secondary dependencies (dependencies of packages that ThreadsX does depend on). Two of the three are stdlib packages though, so not usually a concern with those anyway.
It looks like Dates and DelimitedFiles come from Compat. But I wouldn’t worry about them since they are stdlib. ZygoteRules comes from BangBang and it was kind of nice thing to have for supporting differentiable maybe-inplace functions: [ANN] BangBang.jl: a compatible layer for mutable and immutable data structures (bonus: Zygote.jl support). Minimalistic interface packages like ZygoteRules are the foundation of interoperability in the Julia ecosystem and I would rather cheer that it’s in the dependency of any package (though I should be switching to ChainRulesCore).
Anyway, the heaviest dependency of ThreadsX is Transducers and I really should be dissolving Transducers to smaller packages. Until then, I think everything else in the dependency chain is negligible.
Yeah, the result of ThreadsX.unique is supposed to be identical as long as the user function does not have data races (and there are no relevant bugs in ThreadsX).