Fast Jaccard Distance between sets of strings?

My friend needs to compute the jaccard distance between sets of strings.
The number of sets is hundreds and the number of strings in each set is thousands. The length of the strings is <= 20. He has thousands of datasets of this type.

He has a function to do it in python, but I would like to show that Julia can do it faster :smile:

His python function looks like this:

from scipy.spatial import distance
import pandas as pd
from scipy.spatial.distance import pdist, jaccard
from scipy.spatial.distance import squareform
from collections import Counter

def calcDistances(d):
        df = pd.DataFrame({k:Counter(v) for k, v in d.items()}).T.fillna(0).astype(int)
        jaccard_distances = pdist(df, metric='jaccard')
        jaccard_distances = squareform(jaccard_distances)
        resdf=pd.DataFrame(jaccard_distances, index=df.index, columns=df.index)
        results=resdf.to_dict()
        return(results)

Here’s my attempt in Julia, which does not do the final conversion back to dict, but already takes twice as long:

using Distances, DataFrames, StatsBase

function j1(g)
    x = coalesce.(vcat(DataFrame.([countmap(x) for x in values(g)])..., cols = :union), 0)
    pairwise(Jaccard(), Matrix(x), dims=1)
end

On a large dataset the python function runs in 7.5 sec where my Julia function takes 14 sec.

julia> @time j1(l1);
 13.485393 seconds (34.85 M allocations: 2.139 GiB, 10.61% gc time)

Most of the time is building the DataFrame:

julia> @time vcat(DataFrame.([countmap(x) for x in values(l1)])..., cols = :union);
 10.882624 seconds (34.69 M allocations: 1.940 GiB, 7.80% gc time, 0.81% compilation time)

How can I make this faster?

In the details below I show what the functions are doing on a small data set:

Python:

import collections
from scipy.spatial import distance
import pandas as pd
from scipy.spatial.distance import pdist, jaccard
from scipy.spatial.distance import squareform
from collections import Counter

testDict = collections.defaultdict(set)
testDict["EFB1"]=set(['a','b','c','d','e','e'])
testDict["EFB2"]=set(['a','f','c','d'])
testDict["EFB3"]=set(['a','f'])
d = testDict
>>> df = pd.DataFrame({k:Counter(v) for k, v in d.items()}).T.fillna(0).astype(int)
>>> df
      c  e  b  d  a  f
EFB1  1  1  1  1  1  0
EFB2  1  0  0  1  1  1
EFB3  0  0  0  0  1  1


>>> jaccard_distances = pdist(df, metric='jaccard')
array([0.5       , 0.83333333, 0.5       ])
>>> jaccard_distances = squareform(jaccard_distances)
>>> jaccard_distances
array([[0.        , 0.5       , 0.83333333],
       [0.5       , 0.        , 0.5       ],
       [0.83333333, 0.5       , 0.        ]])

>>> resdf=pd.DataFrame(jaccard_distances, index=df.index, columns=df.index)
>>> resdf
          EFB1  EFB2      EFB3
EFB1  0.000000   0.5  0.833333
EFB2  0.500000   0.0  0.500000
EFB3  0.833333   0.5  0.000000

And for Julia

t1 = Dict(
"EFB1" => Set(["a","b","c","d","e","e"]),
"EFB2" => Set(["a","f","c","d"]),
"EFB3" => Set(["a","f"]),
)
x = coalesce.(vcat(DataFrame.([countmap(x) for x in values(t1)])..., cols = :union), 0)

# 3Γ—6 DataFrame
#  Row β”‚ a      f      b      c      d      e     
#      β”‚ Int64  Int64  Int64  Int64  Int64  Int64 
# ─────┼──────────────────────────────────────────
#    1 β”‚     1      1      0      0      0      0
#    2 β”‚     1      0      1      1      1      1
#    3 β”‚     1      1      0      1      1      0

pairwise(Jaccard(), Matrix(x), dims=1)
#  3Γ—3 Matrix{Float64}:
#  0.0       0.833333  0.5
#  0.833333  0.0       0.5
#  0.5       0.5       0.0

Build fewer temporary data structures.

julia> orig(sets) = coalesce.(vcat(DataFrame.([countmap(x) for x in values(sets)])...,cols = :union),0)
orig (generic function with 1 method)


julia> @time orig(t1)  # 2nd time not to measure compilation
  0.000239 seconds (468 allocations: 33.062 KiB)

julia> function bar(sets)
         df = DataFrame([c => zeros(Int,3) for c in union(values(sets)...)],copycols=false)
         populateRows!(df,sets)
         end

julia> function populateRows!(df, sets)
         for (row, set) in enumerate(values(sets))
           for col in set
             df[row,col] = 1
             end
             end
             df
             end


julia> @time bar(t1)  # second run not to measure compilation
   0.000040 seconds (46 allocations: 3.719 KiB)
1 Like

I would not use DataFrame for this task (you could do your DataFrames.jl code faster, but in this case it is better not to use it if you need performance).

Instead I recommend you use a Matrix directly:

julia> function fun(t1)
           colmap = Dict(x => i for (i, x) in enumerate(union(values(t1)...)))
           mat = zeros(Int, length(t1), length(colmap))
           for (i, x) in enumerate(values(t1))
               for v in x
                   mat[i, colmap[v]] += 1
               end
           end
           return mat
       end
fun (generic function with 1 method)

julia> @time fun(t1)
  0.115027 seconds (169.71 k allocations: 9.692 MiB, 99.98% compilation time)
3Γ—6 Matrix{Int64}:
 1  0  0  0  1  0
 0  1  1  1  1  1
 1  1  0  0  1  1

julia> @time fun(t1)
  0.000015 seconds (17 allocations: 1.266 KiB)
3Γ—6 Matrix{Int64}:
 1  0  0  0  1  0
 0  1  1  1  1  1
 1  1  0  0  1  1
4 Likes

Thank you very much @Jeff_Emanuel and @bkamins !

With the optimization by @bkamins my the julia version is now only half the runtime of the python one. Julia rules :smile:

The best part of Julia is the community!

Thanks again.

3 Likes