Can't figure out why this code isn't working

using CSV
using StringDistances
using Unicode

data = CSV.File(raw"C:\Users\mthel\Julia\src_data\data.csv", normalizenames=true)

data = clean_strings(data[i][2] for i in 1:200)

function build_dict(data)
    dict = Dict(data[1] => [])

    for i in 2:length(data), (key, value) in dict
        if compare(data[i], key, Levenshtein()) > 0.7
            push!(value, data[i])
        else
            dict[data[i]] = []
        end
    end
    
    dict
end

dict = build_dict(data)

When I examine the keys of dict, I see that the following strings have ended up as keys:

TRAFFICCONTROLTECH
TRAFFICCONTROLTEHC
TRAFFICCONTROTECH

but the Levenshtein similarity scores are all higher than the specified threshold of 0.7 (on line 13).

julia> compare("TRAFFICCONTROLTECH", "TRAFFICCONTROLTEHC", Levenshtein())
0.8888888888888888

julia> compare("TRAFFICCONTROLTECH", "TRAFFICCONTROTECH", Levenshtein())
0.9444444444444444

Only one of these should have ended up as a key and the other two should have been detected in the if clause and subsequently pushed to the value associated with the existing key. I’m stumped as to why this is happening. There has to be a silly mistake somewhere that I’m missing…

The main error is that you’re keeping any value whose distance is at least 0.7 from at least one key, whereas I assume you’re intending to keep those whose distance is at least 0.7 from all keys.

Also you shouldn’t modify a dict while iterating over it.

3 Likes

You compare every data[i] to every (key,value) pair, so there is always one with distance y< 0.7

(ninjad)

1 Like

Which data structure should I use for this type of operation?

In this case you can stay with a dict as you don’t need to modify it while iterating:

    for i in 2:length(data)
		similarindex = 0
		for (key, value) in dict
			if compare(data[i], key, Levenshtein()) > 0.7
				similarindex = i
				break
			end
		end
		if similarindex > 0
			push!(value, data[similarindex])
		else
			dict[data[i]] = []
		end
    end

didn’t tested, hope it works like that.

1 Like