Hello, I want to rapidly estimate a distance matrix using Jaccard distance (1 - Jaccard similarity) for a large binary matrix (rows are genomes, columns are genes found in common across genomes).

MinHash seems like the perfect algorithm for this problem. I’m not using k-mers but presence/absence, so I need to roll something different from software like mash or sourmash.

Using Minhash.jl, I am able to rapidly calculate the number of shared genes in a pair of genomes using Minhash.intersectionlength(). However, this is not sufficient to calculate Jaccard similarity, since I have to normalize by the union of genes in the given pair of genomes. For MinHash.jl, I am treating each genome as an array of protein sequences (strings) as input for hashing.

I rolled my own union function for the hashes, but this is slow, especially compared to the speed of Minhash.intersectionlength(). My question is: how can I rapidly calculate/estimate Jaccard similarity using MinHash.jl? @jakobnissen

I also have a nagging feeling that I may be using the API for MinHash.jl incorrectly, since the whole point of the MinHash algorithm is to rapidly estimate Jaccard similarity between pairs of sets.

Thanks!

If I understand correctly, you would like something similar to `intersection_length`

, but for the union instead of the intersection. It can be achieved like this:

```
function union_length(x::MinHashSketch, y::MinHashSketch)
xi, yi = 1, 1
n = 0
@inbounds while (xi ≤ length(x.hashes)) & (yi ≤ length(y.hashes))
xv, yv = x.hashes[xi], y.hashes[yi]
n += 1
xi += xv ≤ yv
yi += yv ≤ xv
end
return n
end
```

In your case you might want to calculate both in one pass:

```
function combined_length(x::MinHashSketch, y::MinHashSketch)
xi, yi = 1, 1
n_intersections = 0
n_unions = 0
@inbounds while (xi ≤ length(x.hashes)) & (yi ≤ length(y.hashes))
xv, yv = x.hashes[xi], y.hashes[yi]
n_unions += 1
n_intersections += xv == yv
xi += xv ≤ yv
yi += yv ≤ xv
end
return (n_intersections, n_unions)
end
```

Note that I haven’t tested these functions, so please test them first!

Do you think I should include a `jaccard`

function in MinHash?

Thanks, Jakob! Yes, this is along the lines of what I am looking for.

Could you also provide a version of `unionlength`

with the same type signature as

`function intersectionlength(sketches::AbstractVector{MinHashSketch})`

?

(I tried to puzzle this out myself, but I’m afraid I’m not sophisticated enough).

a `jaccard`

function in MinHash would be great too, especially if it could produce a similarity matrix from an AbstractVector{MinHashSketch}!