Hi, I’m trying to subtract to char, but goting this issue - ERROR: InexactError: trunc(UInt32, -65)
Please, may someone help?
function generate_frequencies(data::String)
data = replace(strip(lowercase(data)), "\n" => "")
num_count = 0
frequencies = Dict{Char, Int64}
for c in data
println(typeof(c))
ch = Char(c - 'a') # this fails
println(typeof(ch))
if ch >= 'a' && ch <= 'z'
if haskey(frequencies, ch)
frequencies[ch] += 1
num_count += 1
else
frequencies[ch] = 0
end
end
end
frequencies / num_count
end
Character code points are always positive, but c - 'a' can be negative if c is a Char with a character code point less than that of a, for example if c == ':':
julia> ':' - 'a'
-39
The best way around this is probably just to work with regular signed integers instead of converting the result of this subtraction back to a Char.
I need work with char because I had a dictionary frequencies = Dict{Char, Int64} where I will aggregate the frequencies of chars. I cant believe this cant be done.
I had changed the code to get the default Char from c, but also, I cant even compare if the entry exists in the dictionary
no method matching haskey(::Type{Dict{Char,Int64}}, ::Char)
@jcbritobr if you are building some text processing pipeline, consider checking the projects in the JuliaText organization: https://github.com/JuliaText
I will let people with more experience in text processing address the specific issue you are having with low level char operations.
Yep, you had a point. The logics are wrong, but the issue is conversion. I was fixing the code just right now and I get it working.
Subtracting c-‘a’ I was dealing with an index, but i had a dictionary and the char is just what I need
Map the result dividing all values by the num_count, to get the frequency.
Thank you all guys, for help
function generate_frequencies(data::String)
data = replace(strip(lowercase(data)), "\n" => "")
num_count = 0
frequencies = Dict{Char, Int64}()
for c in data
if c >= 'a' && c <= 'z'
if haskey(frequencies, c)
frequencies[c] += 1
num_count += 1
else
frequencies[c] = 0
end
end
end
map!(x -> floor(x / num_count), values(frequencies))
frequencies
end
Otherwise, what you are storing is not the number of occurrences but one less than the number of occurrences. A letter that only appears a single time will have value zero.
Just to make sure it’s clear in the end, when you do c - 'a' you get an offset of the character c from 'a': this is logically an integer, not a character, because it is an offset. You can convert that offset back to a character, but that gives you the character whose Unicode code point is that offset — which doesn’t make a whole lot of sense. For example:
julia> c = 'k'
'k': ASCII/Unicode U+006B (category Ll: Letter, lowercase)
julia> c - 'a'
10
# 'k' is 10 chars after 'a'
julia> Char(c - 'a')
'\n': ASCII/Unicode U+000A (category Cc: Other, control)
# newline has code point 10
You can use these offsets as keys if you want to: then your keys for English letters will be 0 through 25. But in that case your dict should have type Dict{Int, Int} instead of Dict{Char, Int}. This would also allow using a much more efficient data structure for counting frequencies, like fill(0, 26) which is a Vector{Int}. However, in all of these cases, when you translate the data back to characters, you’ll have to do 'a' + offset to get a character back from an offset.
The other approach is what you chose yourself, which is to just use Char values as keys and not try to subtract 'a' from them. This has the advantage that it is simpler and will handle absolutely any character value at all if you decide to stop filtering the characters.
If you are comparing letter frequencies for English text to a non-English language like Portugese, you probably want to strip diacritical marks from the data:
julia> data = "Portugal, oficialmente República Portuguesa, é um país soberano unitário localizado no sudoeste da Europa, cujo território se situa na zona ocidental da Península Ibérica e em arquipélagos no Atlântico Norte.";
julia> using Unicode
julia> Unicode.normalize(data, stripmark=true, casefold=true)
"portugal, oficialmente republica portuguesa, e um pais soberano unitario localizado no sudoeste da europa, cujo territorio se situa na zona ocidental da peninsula iberica e em arquipelagos no atlantico norte."