I would like to compare each string in the array with every other string in the array and compute the distance between the two via the StringDistances.jl package (which will allow me to group similar strings together). For example, "John Doe" and "Jon Doe" are similar enough that I would want to remove both of those elements from the array (and store them somewhere else for later).
I wrote a function that achieves this, but I know that the amazing Julia community can figure out a way to make it more performant! My first attempt to accomplish the above uses recursion and looks like this:
names = ["John Doe", "Bugs Bunny", "Cosmo Kramer", "Jon Doe", "Elmer Fudd", "Kosmo Cramer"]
all_matches = Array
if length(arr) > 1
indexes = findall(x -> compare(Jaro(), x, arr) > 0.8, arr)
push!(all_matches, [arr[i] for i in indexes])
return filter(x -> length(x) > 1, all_matches)
matched = matches(names)
["John Doe", "Jon Doe"]
["Cosmo Kramer", "Kosmo Cramer"]
I don’t know much about Julia performance but I do know that there is usually a much faster way to do things relative to whatever it is that I usually come up with
I generally prefer a functional approach, eg something like
function filter_matches(str, other_strs)
filter(x -> compare(Jaro(), x, str) < 0.8, other_strs)
i = 1
while i < length(strs)
strs = vcat(@view(strs[1:i]),
filter_matches(strs[i], @view(strs[(i + 1):end])))
i += 1
Thanks, Tamas! I was actually just reading in the McNicholas/Tait book (Data Science with Julia) that splitting up complex/repeated computations into separate functions can result in significant performance gains. Your approach separates the filter/compare piece from the rest, so I’ll give it a shot and see if it’s more performant!
I’m trying to optimize for performance as much as possible because when passing in an array that has hundreds of thousands of items, it is very slow (takes several minutes to complete). I’m going to look into the compare function from StringDistances.jl and see if maybe there are some optimizations there that could speed things up for me.
I think Tamas’ point is exactly right, compare will be your issue here - I asked a similar question a while ago and the advice (rightly as it turned out) was a similar one.
What that means I guess is that it can pay to think a bit more about how you can minimize the number of comparisons you have to do, e.g. normalise strings (lowercase and remove spaces) and remove exact duplicates first, maybe remove duplicates immediately so that they don’t get compared again etc. and any other tricks which might be problem specific!
Great advice. I am currently implementing a check to see if one string is significantly longer than the other and, if this is the case, I will skip the comparison. For example, “JR Smith” won’t be compared to “Jonathan Reed Smithsonian” because, due to the significant difference in length, the result of the compare function won’t meet the threshold that I’m establishing anyways.
This is a really nice idea…I’m not sure I have the ability to execute it, but I’m going to take a look and maybe see if the author of StringDistances.jl is interested in adding this functionality. I think when people are computing string distances for purposes of ranking, they want to compute the actual distance each time, but if the distances are being computed with a threshold in mind (for filtering rather than ranking), the ability to stop the execution as soon as the threshold is met would be awesome.