Hello, I want to rapidly estimate a distance matrix using Jaccard distance (1 - Jaccard similarity) for a large binary matrix (rows are genomes, columns are genes found in common across genomes).
MinHash seems like the perfect algorithm for this problem. I’m not using k-mers but presence/absence, so I need to roll something different from software like mash or sourmash.
Using Minhash.jl, I am able to rapidly calculate the number of shared genes in a pair of genomes using Minhash.intersectionlength(). However, this is not sufficient to calculate Jaccard similarity, since I have to normalize by the union of genes in the given pair of genomes. For MinHash.jl, I am treating each genome as an array of protein sequences (strings) as input for hashing.
I rolled my own union function for the hashes, but this is slow, especially compared to the speed of Minhash.intersectionlength(). My question is: how can I rapidly calculate/estimate Jaccard similarity using MinHash.jl? @jakobnissen
I also have a nagging feeling that I may be using the API for MinHash.jl incorrectly, since the whole point of the MinHash algorithm is to rapidly estimate Jaccard similarity between pairs of sets.
Thanks!
If I understand correctly, you would like something similar to intersection_length
, but for the union instead of the intersection. It can be achieved like this:
function union_length(x::MinHashSketch, y::MinHashSketch)
xi, yi = 1, 1
n = 0
@inbounds while (xi ≤ length(x.hashes)) & (yi ≤ length(y.hashes))
xv, yv = x.hashes[xi], y.hashes[yi]
n += 1
xi += xv ≤ yv
yi += yv ≤ xv
end
return n
end
In your case you might want to calculate both in one pass:
function combined_length(x::MinHashSketch, y::MinHashSketch)
xi, yi = 1, 1
n_intersections = 0
n_unions = 0
@inbounds while (xi ≤ length(x.hashes)) & (yi ≤ length(y.hashes))
xv, yv = x.hashes[xi], y.hashes[yi]
n_unions += 1
n_intersections += xv == yv
xi += xv ≤ yv
yi += yv ≤ xv
end
return (n_intersections, n_unions)
end
Note that I haven’t tested these functions, so please test them first!
Do you think I should include a jaccard
function in MinHash?
Thanks, Jakob! Yes, this is along the lines of what I am looking for.
Could you also provide a version of unionlength
with the same type signature as
function intersectionlength(sketches::AbstractVector{MinHashSketch})
?
(I tried to puzzle this out myself, but I’m afraid I’m not sophisticated enough).
a jaccard
function in MinHash would be great too, especially if it could produce a similarity matrix from an AbstractVector{MinHashSketch}!