The very first thing I would do before any of this is some profiling on the actual data. Unfortunately, worst case scenarios are O(n^2) for your problem.
Yes, this could help if there are a lot of duplicates.
Also, investing in a string comparison function that can bail out early if the difference metric is guaranteed to be above some threshold could be worthwhile.