Can't figure out why this code isn't working

mthelm85 · September 21, 2021, 7:24pm

using CSV
using StringDistances
using Unicode

data = CSV.File(raw"C:\Users\mthel\Julia\src_data\data.csv", normalizenames=true)

data = clean_strings(data[i][2] for i in 1:200)

function build_dict(data)
    dict = Dict(data[1] => [])

    for i in 2:length(data), (key, value) in dict
        if compare(data[i], key, Levenshtein()) > 0.7
            push!(value, data[i])
        else
            dict[data[i]] = []
        end
    end
    
    dict
end

dict = build_dict(data)

When I examine the keys of dict, I see that the following strings have ended up as keys:

TRAFFICCONTROLTECH
TRAFFICCONTROLTEHC
TRAFFICCONTROTECH

but the Levenshtein similarity scores are all higher than the specified threshold of 0.7 (on line 13).

julia> compare("TRAFFICCONTROLTECH", "TRAFFICCONTROLTEHC", Levenshtein())
0.8888888888888888

julia> compare("TRAFFICCONTROLTECH", "TRAFFICCONTROTECH", Levenshtein())
0.9444444444444444

Only one of these should have ended up as a key and the other two should have been detected in the if clause and subsequently pushed to the value associated with the existing key. I’m stumped as to why this is happening. There has to be a silly mistake somewhere that I’m missing…

cjdoris · September 21, 2021, 7:36pm

The main error is that you’re keeping any value whose distance is at least 0.7 from at least one key, whereas I assume you’re intending to keep those whose distance is at least 0.7 from all keys.

Also you shouldn’t modify a dict while iterating over it.

oheil · September 21, 2021, 7:36pm

You compare every data[i] to every (key,value) pair, so there is always one with distance y< 0.7

(ninjad)

mthelm85 · September 21, 2021, 7:39pm

Which data structure should I use for this type of operation?

oheil · September 21, 2021, 7:48pm

In this case you can stay with a dict as you don’t need to modify it while iterating:

    for i in 2:length(data)
		similarindex = 0
		for (key, value) in dict
			if compare(data[i], key, Levenshtein()) > 0.7
				similarindex = i
				break
			end
		end
		if similarindex > 0
			push!(value, data[similarindex])
		else
			dict[data[i]] = []
		end
    end

didn’t tested, hope it works like that.

Topic		Replies	Views
Issue while using dictionary in Julia General Usage question	4	417	July 20, 2021
I am probably missing something very obvious. Can someone help me understand this? General Usage	1	358	February 13, 2021
Setting up Dict example puzzled by syntax and functionality New to Julia dictionary	16	1013	February 21, 2022
Keep the dictionary unchanged New to Julia dictionary	8	354	November 15, 2021
Pushing values to dic New to Julia	7	457	September 8, 2019

Can't figure out why this code isn't working

Related topics