How to estimate Jaccard similarity/distances with MinHash.jl

Hello, I want to rapidly estimate a distance matrix using Jaccard distance (1 - Jaccard similarity) for a large binary matrix (rows are genomes, columns are genes found in common across genomes).

MinHash seems like the perfect algorithm for this problem. I’m not using k-mers but presence/absence, so I need to roll something different from software like mash or sourmash.

Using Minhash.jl, I am able to rapidly calculate the number of shared genes in a pair of genomes using Minhash.intersectionlength(). However, this is not sufficient to calculate Jaccard similarity, since I have to normalize by the union of genes in the given pair of genomes. For MinHash.jl, I am treating each genome as an array of protein sequences (strings) as input for hashing.

I rolled my own union function for the hashes, but this is slow, especially compared to the speed of Minhash.intersectionlength(). My question is: how can I rapidly calculate/estimate Jaccard similarity using MinHash.jl? @jakobnissen

I also have a nagging feeling that I may be using the API for MinHash.jl incorrectly, since the whole point of the MinHash algorithm is to rapidly estimate Jaccard similarity between pairs of sets.

Thanks!

If I understand correctly, you would like something similar to intersection_length, but for the union instead of the intersection. It can be achieved like this:

function union_length(x::MinHashSketch, y::MinHashSketch)
    xi, yi = 1, 1
    n = 0
    @inbounds while (xi ≤ length(x.hashes)) & (yi ≤ length(y.hashes))
        xv, yv = x.hashes[xi], y.hashes[yi]
        n += 1
        xi += xv ≤ yv
        yi += yv ≤ xv
    end
    return n
end

In your case you might want to calculate both in one pass:

function combined_length(x::MinHashSketch, y::MinHashSketch)
    xi, yi = 1, 1
    n_intersections = 0
    n_unions = 0
    @inbounds while (xi ≤ length(x.hashes)) & (yi ≤ length(y.hashes))
        xv, yv = x.hashes[xi], y.hashes[yi]
        n_unions += 1
        n_intersections += xv == yv
        xi += xv ≤ yv
        yi += yv ≤ xv
    end
    return (n_intersections, n_unions)
end

Note that I haven’t tested these functions, so please test them first!

Do you think I should include a jaccard function in MinHash?

Thanks, Jakob! Yes, this is along the lines of what I am looking for.

Could you also provide a version of unionlength with the same type signature as
function intersectionlength(sketches::AbstractVector{MinHashSketch}) ?

(I tried to puzzle this out myself, but I’m afraid I’m not sophisticated enough).

a jaccard function in MinHash would be great too, especially if it could produce a similarity matrix from an AbstractVector{MinHashSketch}!