Jaccard distance of each combo of elements in 2 lists

What’s the most efficient way to get the Jaccard distance of each combo of elements in 2 lists?

Let’s say I have 2 arrays that look like this:

x= 
"ap p le"
"or an ge"
"ap pl e"
"ap pl e”
”or an ge"
"ho ne yc ri sp ap pl e"
y=
"ho ne yc ri sp ap pl e"
"ho ne yc ri sp ap pl e"
"or an ge"
"ap pl e"
"or an ge"
"ho ne yc ri sp ap pl e"

I’m trying to get the Jaccard distance for each pair of words (first being “ap p le” and "ho ne yc ri sp ap pl e”). So far I have a simple function :

map(x, y) do s1, s2
evaluate(Jaccard(2), x, y)
end

Which prints out 5 values all being .5. However, when I run each individual combo of words, I get: .7, .962, 1.00, 0, 0, 0.

My goal is to find an efficient way to print out x,y, jaccard_distance. Basically 3 columns, with the last being the following Jaccard distances: .7, .962, 1.00, 0, 0, 0.

1 Like

You can use dot-broadcasting to evaluate distances, and either a table library or DataFrames for tabulated printing:

julia> using StringDistances, DataFrames

julia> x = ["ap p le", "or an ge", "ap pl e", "ap pl e", "or an ge", "ho ne yc ri sp ap pl e"];

julia> y = ["ho ne yc ri sp ap pl e", "ho ne yc ri sp ap pl e", "or an ge", "ap pl e", "or an ge", "ho ne yc ri sp ap pl e"];

julia> dist = evaluate.((Jaccard(2),), x, y);

julia> DataFrame(x = x, y = y, JaccardDistances = dist)
6×3 DataFrame
│ Row │ x                      │ y                      │ JaccardDistances │
│     │ String                 │ String                 │ Float64          │
├─────┼────────────────────────┼────────────────────────┼──────────────────┤
│ 1   │ ap p le                │ ho ne yc ri sp ap pl e │ 0.863636         │
│ 2   │ or an ge               │ ho ne yc ri sp ap pl e │ 0.961538         │
│ 3   │ ap pl e                │ or an ge               │ 1.0              │
│ 4   │ ap pl e                │ ap pl e                │ 0.0              │
│ 5   │ or an ge               │ or an ge               │ 0.0              │
│ 6   │ ho ne yc ri sp ap pl e │ ho ne yc ri sp ap pl e │ 0.0              │
1 Like

This won’t go down in history as the fastest solution but, does this help?

Xs = ["ap p le","or an ge", "ap pl e","ap pl e","or an ge","ho ne yc ri sp ap pl e"]
Ys = ["ho ne yc ri sp ap pl e","ho ne yc ri sp ap pl e","or an ge","ap pl e","or an ge",
    "ho ne yc ri sp ap pl e"]

Xs_set,Ys_set = Set.(Xs), Set.(Ys)

using DataFrames
df = DataFrame(:X=> [], :Y=> [], :Jaccard => [])

for (x,xset) in zip(Xs,Xs_set), (y,yset) in zip(Ys,Ys_set)
    push!(df, [x, y, length(intersect(xset,yset))/length(union(xset,yset)) ])
end

println(df)

edit - oh I’ve misunderstood the prompt entirely - well I tried :D. Question pertains to sets of 2 chars.

1 Like