My friend needs to compute the jaccard distance between sets of strings.
The number of sets is hundreds and the number of strings in each set is thousands. The length of the strings is <= 20. He has thousands of datasets of this type.
He has a function to do it in python, but I would like to show that Julia can do it faster
His python function looks like this:
from scipy.spatial import distance
import pandas as pd
from scipy.spatial.distance import pdist, jaccard
from scipy.spatial.distance import squareform
from collections import Counter
def calcDistances(d):
df = pd.DataFrame({k:Counter(v) for k, v in d.items()}).T.fillna(0).astype(int)
jaccard_distances = pdist(df, metric='jaccard')
jaccard_distances = squareform(jaccard_distances)
resdf=pd.DataFrame(jaccard_distances, index=df.index, columns=df.index)
results=resdf.to_dict()
return(results)
Hereβs my attempt in Julia, which does not do the final conversion back to dict, but already takes twice as long:
using Distances, DataFrames, StatsBase
function j1(g)
x = coalesce.(vcat(DataFrame.([countmap(x) for x in values(g)])..., cols = :union), 0)
pairwise(Jaccard(), Matrix(x), dims=1)
end
On a large dataset the python function runs in 7.5 sec where my Julia function takes 14 sec.
julia> @time j1(l1);
13.485393 seconds (34.85 M allocations: 2.139 GiB, 10.61% gc time)
Most of the time is building the DataFrame:
julia> @time vcat(DataFrame.([countmap(x) for x in values(l1)])..., cols = :union);
10.882624 seconds (34.69 M allocations: 1.940 GiB, 7.80% gc time, 0.81% compilation time)
How can I make this faster?
In the details below I show what the functions are doing on a small data set:
Python:
import collections
from scipy.spatial import distance
import pandas as pd
from scipy.spatial.distance import pdist, jaccard
from scipy.spatial.distance import squareform
from collections import Counter
testDict = collections.defaultdict(set)
testDict["EFB1"]=set(['a','b','c','d','e','e'])
testDict["EFB2"]=set(['a','f','c','d'])
testDict["EFB3"]=set(['a','f'])
d = testDict
>>> df = pd.DataFrame({k:Counter(v) for k, v in d.items()}).T.fillna(0).astype(int)
>>> df
c e b d a f
EFB1 1 1 1 1 1 0
EFB2 1 0 0 1 1 1
EFB3 0 0 0 0 1 1
>>> jaccard_distances = pdist(df, metric='jaccard')
array([0.5 , 0.83333333, 0.5 ])
>>> jaccard_distances = squareform(jaccard_distances)
>>> jaccard_distances
array([[0. , 0.5 , 0.83333333],
[0.5 , 0. , 0.5 ],
[0.83333333, 0.5 , 0. ]])
>>> resdf=pd.DataFrame(jaccard_distances, index=df.index, columns=df.index)
>>> resdf
EFB1 EFB2 EFB3
EFB1 0.000000 0.5 0.833333
EFB2 0.500000 0.0 0.500000
EFB3 0.833333 0.5 0.000000
And for Julia
t1 = Dict(
"EFB1" => Set(["a","b","c","d","e","e"]),
"EFB2" => Set(["a","f","c","d"]),
"EFB3" => Set(["a","f"]),
)
x = coalesce.(vcat(DataFrame.([countmap(x) for x in values(t1)])..., cols = :union), 0)
# 3Γ6 DataFrame
# Row β a f b c d e
# β Int64 Int64 Int64 Int64 Int64 Int64
# ββββββΌββββββββββββββββββββββββββββββββββββββββββ
# 1 β 1 1 0 0 0 0
# 2 β 1 0 1 1 1 1
# 3 β 1 1 0 1 1 0
pairwise(Jaccard(), Matrix(x), dims=1)
# 3Γ3 Matrix{Float64}:
# 0.0 0.833333 0.5
# 0.833333 0.0 0.5
# 0.5 0.5 0.0