What's the best way to compare items in an array to every other item?

Say I have the following array:

names = ["John Doe", "Bugs Bunny", "Cosmo Kramer", "Jon Doe", "Elmer Fudd", "Kosmo Cramer"]

I would like to compare each string in the array with every other string in the array and compute the distance between the two via the StringDistances.jl package (which will allow me to group similar strings together). For example, "John Doe" and "Jon Doe" are similar enough that I would want to remove both of those elements from the array (and store them somewhere else for later).

I wrote a function that achieves this, but I know that the amazing Julia community can figure out a way to make it more performant! My first attempt to accomplish the above uses recursion and looks like this:

using StringDistances

names = ["John Doe", "Bugs Bunny", "Cosmo Kramer", "Jon Doe", "Elmer Fudd", "Kosmo Cramer"]
all_matches = Array[]

function matches(arr::Array)
    if length(arr) > 1
        indexes = findall(x -> compare(Jaro(), x, arr[1]) > 0.8, arr)
        push!(all_matches, [arr[i] for i in indexes])
        deleteat!(arr, indexes)
        matches(arr)
    else
        return filter(x -> length(x) > 1, all_matches)
    end
end

matched = matches(names)
2-element Array{Array,1}:
 ["John Doe", "Jon Doe"]
 ["Cosmo Kramer", "Kosmo Cramer"]

I don’t know much about Julia performance but I do know that there is usually a much faster way to do things relative to whatever it is that I usually come up with :grin:

The very first thing to do is to give all_matches a proper concrete type. Use Vector{String}[].

Ah, thanks!

Well… Deleting from an arbitrary position of an array is pretty slow as it involves data movement. I think a double loop with some bookkeeping would do:

function matches(arr::Vector{String})
    all_matches = Vector{String}[]
    first_unprocessed = 1
    num_unprocessed = nmax = length(arr)
    indexes = Int[]
    unprocessed = trues(nmax)
    while num_unprocessed > 0
        compar = arr[first_unprocessed]
        unprocessed[first_unprocessed] = false
        empty!(indexes)
        push!(indexes, first_unprocessed)
        for j = first_unprocessed + 1:nmax
            if unprocessed[j] && compare(Jaro(), arr[j], compar) > 0.8
                push!(indexes, j)
                (j == first_unprocessed + 1) && (first_unprocessed = j)
                unprocessed[j] = true
                num_unprocessed -= 1
            end
        end
        length(indexes) == 1 || push!(all_matches, [arr[i] for i in indexes])
        first_unprocessed += 1
        num_unprocessed -= 1
    end
    all_matches
end
1 Like

Thanks for the response, Vasily! On my machine, this version seems to be quite a bit slower than my original function:

all_matches = Vector{String}[]
names = ["John Doe", "Bugs Bunny", "Cosmo Kramer", "Jon Doe", "Elmer Fudd", "Kosmo Cramer"]

function matches(arr::Array)
    if length(arr) > 1
        indexes = findall(x -> compare(Jaro(), x, arr[1]) > 0.8, arr)
        push!(all_matches, [arr[i] for i in indexes])
        deleteat!(arr, indexes)
        matches(arr)
    else
        return filter(x -> length(x) > 1, all_matches)
    end
end

function matches1(arr::Vector{String})
    all_matches = Vector{String}[]
    first_unprocessed = 1
    num_unprocessed = nmax = length(arr)
    indexes = Int[]
    unprocessed = trues(nmax)
    while num_unprocessed > 0
        compar = arr[first_unprocessed]
        unprocessed[first_unprocessed] = false
        empty!(indexes)
        push!(indexes, first_unprocessed)
        for j = first_unprocessed + 1:nmax
            if unprocessed[j] && compare(Jaro(), arr[j], compar) > 0.8
                push!(indexes, j)
                (j == first_unprocessed + 1) && (first_unprocessed = j)
                unprocessed[j] = true
                num_unprocessed -= 1
            end
        end
        length(indexes) == 1 || push!(all_matches, [arr[i] for i in indexes])
        first_unprocessed += 1
        num_unprocessed -= 1
    end
    return all_matches
end

@btime matched1 = matches1(names)
  5.282 ÎĽs (54 allocations: 6.09 KiB)
  2-element Array{Array{String,1},1}:
  ["John Doe", "Jon Doe"]
  ["Cosmo Kramer", "Kosmo Cramer"]

@btime matched = matches(names)
  320.235 ns (2 allocations: 128 bytes)
  2-element Array{Array{String,1},1}:
  ["John Doe", "Jon Doe"]
  ["Cosmo Kramer", "Kosmo Cramer"]
julia> @btime matched = matches(names)
  269.757 ns (2 allocations: 128 bytes)
2-element Array{Array{String,1},1}:
 ["John Doe", "Jon Doe"]         
 ["Cosmo Kramer", "Kosmo Cramer"]

julia> names
1-element Array{String,1}:
 "Elmer Fudd"

Rings a bell? :slight_smile:
Your function deletes matches from the original array, so that anything but the first run don’t do anything.

I generally prefer a functional approach, eg something like

using StringDistances

function filter_matches(str, other_strs)
    filter(x -> compare(Jaro(), x, str) < 0.8, other_strs)
end

function remove_duplicates(strs)
    i = 1
    while i < length(strs)
        strs = vcat(@view(strs[1:i]),
                    filter_matches(strs[i], @view(strs[(i + 1):end])))
        i += 1
    end
    strs
end

Which runs as


julia> names = ["John Doe", "Bugs Bunny", "Cosmo Kramer",
       "Jon Doe", "Elmer Fudd", "Kosmo Cramer"]
6-element Array{String,1}:
 "John Doe"    
 "Bugs Bunny"  
 "Cosmo Kramer"
 "Jon Doe"     
 "Elmer Fudd"  
 "Kosmo Cramer"

julia> remove_duplicates(names)
4-element Array{String,1}:
 "John Doe"    
 "Bugs Bunny"  
 "Cosmo Kramer"
 "Elmer Fudd"  

I would not worry about performance too much, I think your bottleneck is compare, not the vector operations.

1 Like

Thanks, Tamas! I was actually just reading in the McNicholas/Tait book (Data Science with Julia) that splitting up complex/repeated computations into separate functions can result in significant performance gains. Your approach separates the filter/compare piece from the rest, so I’ll give it a shot and see if it’s more performant!

I’m trying to optimize for performance as much as possible because when passing in an array that has hundreds of thousands of items, it is very slow (takes several minutes to complete). I’m going to look into the compare function from StringDistances.jl and see if maybe there are some optimizations there that could speed things up for me.

Thanks again.

1 Like

I think Tamas’ point is exactly right, compare will be your issue here - I asked a similar question a while ago and the advice (rightly as it turned out) was a similar one.

What that means I guess is that it can pay to think a bit more about how you can minimize the number of comparisons you have to do, e.g. normalise strings (lowercase and remove spaces) and remove exact duplicates first, maybe remove duplicates immediately so that they don’t get compared again etc. and any other tricks which might be problem specific!

1 Like

Great advice. I am currently implementing a check to see if one string is significantly longer than the other and, if this is the case, I will skip the comparison. For example, “JR Smith” won’t be compared to “Jonathan Reed Smithsonian” because, due to the significant difference in length, the result of the compare function won’t meet the threshold that I’m establishing anyways.

The very first thing I would do before any of this is some profiling on the actual data. Unfortunately, worst case scenarios are O(n^2) for your problem.

Yes, this could help if there are a lot of duplicates.

Also, investing in a string comparison function that can bail out early if the difference metric is guaranteed to be above some threshold could be worthwhile.

This is a really nice idea…I’m not sure I have the ability to execute it, but I’m going to take a look and maybe see if the author of StringDistances.jl is interested in adding this functionality. I think when people are computing string distances for purposes of ranking, they want to compute the actual distance each time, but if the distances are being computed with a threshold in mind (for filtering rather than ranking), the ability to stop the execution as soon as the threshold is met would be awesome.