Why do these two functions benchmark the same?

nilshg · August 2, 2018, 4:50pm

I’m running some string comparisons using StringDistances and have written some simple convenience functions that are loosely resembling the extract function in the Python fuzzywuzzy package. Essentially they compare a string to a list of strings and return either the best match alongside the similarity score, or the n best matches and their similarity scores.

My question is: why does the function that returns the n best matches and their scores take the same time as the function that only returns one match and score? I’m aware that the compiler does some pretty smart optimisations, but find it hard to believe that all of the additional work for returning multiple results is free, so I must be missing something?

Code below (results are obtained using juliabox):

using StringDistances, BenchmarkTools

master = [lowercase.(randstring(8)) for i = 1:1200]
to_match = [lowercase.(randstring(8)) for i = 1:1000];

function get_matches(s::String, masterlist::Vector{String}; criterion=Levenshtein(), matches=3)
    closeness = [compare(criterion, s, masterlist[i]) for i = 1:length(masterlist)]
    sort_close = sort(closeness, rev=true)
    ret = Array{Tuple{String,Float64},1}(matches)
    @inbounds for i ∈ 1:matches
        ret[i] = (masterlist[find(closeness .== sort_close[i])[1]], sort_close[i])
    end
    return ret
end   

function get_one_match(s::String, masterlist::Vector{String}; criterion=Levenshtein())
    closeness = [compare(criterion, s, masterlist[i]) for i = 1:length(masterlist)]
    return (masterlist[indmax(closeness)], maximum(closeness))
end   

@btime get_one_match(to_match[1], master, criterion=TokenMax(RatcliffObershelp()))
# 9.063 ms (92513 allocations: 6.79 MiB)
# ("ksifwytx", 0.5)

@btime get_matches(to_match[1], master, criterion=TokenMax(RatcliffObershelp()))
# 9.031 ms (92542 allocations: 6.82 MiB)
# ("ksifwytx", 0.5)  
# ("cpyztasz", 0.375)
# ("cpyztasz", 0.375)

rdeits · August 2, 2018, 5:30pm

It’s probably because you’re still doing compare(...) for each element in the list and that’s completely dominating the runtime. How long does a single compare(...) call take?

Juser · August 3, 2018, 2:58am

I think what @rdeits said is right. Also, in case you’re asking this question because you want to improve the speed of your code, you could probably substantially improve performance if you just kept track of the current 3 best match scores (and their indices). Moreover, you could stop comparing s to a string as soon as its score is worse than the 3rd best. This would:

(1) Save time in completing the precise comparison for very different strings.
(2) Avoid the need to sort.
(3) Avoid a bunch of memory allocation.

Topic		Replies	Views
Approximate string matching for two lists of names in Julia General Usage question	8	422	April 3, 2024
Identical functions repeated benchmarks show systematic differences Performance question , sort	37	2778	August 2, 2021
Benchmarking all() vs for-loop General Usage question , benchmark	4	491	July 20, 2022
Julia vs SciPy - Performance comparison and benchmark help Performance benchmark	15	1337	April 22, 2021
Float64 comparison operator performance Performance	8	1044	September 26, 2019

Why do these two functions benchmark the same?

Related topics