How can I do arithmetics with Char objects?

jcbritobr · October 23, 2020, 8:23pm

Hi, I’m trying to subtract to char, but goting this issue - ERROR: InexactError: trunc(UInt32, -65)
Please, may someone help?

function generate_frequencies(data::String)

    data = replace(strip(lowercase(data)), "\n" => "")

    num_count = 0

    frequencies = Dict{Char, Int64}

    for c in data

        println(typeof(c))

        ch =  Char(c - 'a') # this fails 

        println(typeof(ch))

        if ch >= 'a' && ch <= 'z'

            if haskey(frequencies, ch)

                frequencies[ch] += 1

                num_count += 1

            else

                frequencies[ch] = 0

            end

        end

    end

    frequencies / num_count

end

simeonschaub · October 23, 2020, 8:30pm

Character code points are always positive, but c - 'a' can be negative if c is a Char with a character code point less than that of a, for example if c == ':':

julia> ':' - 'a'
-39

The best way around this is probably just to work with regular signed integers instead of converting the result of this subtraction back to a Char.

jcbritobr · October 23, 2020, 8:35pm

I need work with char because I had a dictionary frequencies = Dict{Char, Int64} where I will aggregate the frequencies of chars. I cant believe this cant be done.

jcbritobr · October 23, 2020, 8:40pm

I had changed the code to get the default Char from c, but also, I cant even compare if the entry exists in the dictionary
no method matching haskey(::Type{Dict{Char,Int64}}, ::Char)

juliohm · October 23, 2020, 8:42pm

@jcbritobr if you are building some text processing pipeline, consider checking the projects in the JuliaText organization: https://github.com/JuliaText

I will let people with more experience in text processing address the specific issue you are having with low level char operations.

jcbritobr · October 23, 2020, 8:44pm

Thank you @juliohm. I’m Julio too.

jcbritobr · October 23, 2020, 9:02pm

I found one of the issues. The dictionary was created incorrectly. In think was not initialized
This fix

no method matching haskey(::Type{Dict{Char,Int64}}, ::Char)

frequencies = Dict{Char, Int64}()

Jeff_Emanuel · October 23, 2020, 9:29pm

Do you really want to do this

when you test like this?

jcbritobr · October 23, 2020, 9:40pm

Yep, you had a point. The logics are wrong, but the issue is conversion. I was fixing the code just right now and I get it working.

Subtracting c-‘a’ I was dealing with an index, but i had a dictionary and the char is just what I need
Map the result dividing all values by the num_count, to get the frequency.

Thank you all guys, for help

function generate_frequencies(data::String)

    data = replace(strip(lowercase(data)), "\n" => "")

    num_count = 0

    frequencies = Dict{Char, Int64}()

    for c in data

        if c >= 'a' && c <= 'z'

            if haskey(frequencies, c)

                frequencies[c] += 1

                num_count += 1

            else

                frequencies[c] = 0

            end

        end

    end

    map!(x -> floor(x / num_count), values(frequencies))

    frequencies

end

jcbritobr · October 23, 2020, 9:42pm

Julia is not so easy. We must pay attention. There is a lot of tricks, and I’m a begginer. Just started yesterday.

Jeff_Emanuel · October 23, 2020, 9:49pm

Do you really want to initialize the count to zero on the first occurrence? You also miss incrementing num_count on first occurrences.

Those logic errors are not inherent to Julia.

jcbritobr · October 23, 2020, 10:19pm

Its working. The code on first occurrence is just to initialize the key for letter. It must not increment. Here are my results

Henrique_Becker · October 23, 2020, 10:49pm

I think what @Jeff_Emanuel meant was that

should be

frequencies[c] = 1

instead.

Otherwise, what you are storing is not the number of occurrences but one less than the number of occurrences. A letter that only appears a single time will have value zero.

jcbritobr · October 23, 2020, 10:55pm

Yes. Makes all sense. You both are right. Here are the plot fixed.
Because I’m a beginner, I can’t do more posts today

StefanKarpinski · October 23, 2020, 11:05pm

Just to make sure it’s clear in the end, when you do c - 'a' you get an offset of the character c from 'a': this is logically an integer, not a character, because it is an offset. You can convert that offset back to a character, but that gives you the character whose Unicode code point is that offset — which doesn’t make a whole lot of sense. For example:

julia> c = 'k'
'k': ASCII/Unicode U+006B (category Ll: Letter, lowercase)

julia> c - 'a'
10
# 'k' is 10 chars after 'a'

julia> Char(c - 'a')
'\n': ASCII/Unicode U+000A (category Cc: Other, control)
# newline has code point 10

You can use these offsets as keys if you want to: then your keys for English letters will be 0 through 25. But in that case your dict should have type Dict{Int, Int} instead of Dict{Char, Int}. This would also allow using a much more efficient data structure for counting frequencies, like fill(0, 26) which is a Vector{Int}. However, in all of these cases, when you translate the data back to characters, you’ll have to do 'a' + offset to get a character back from an offset.

The other approach is what you chose yourself, which is to just use Char values as keys and not try to subtract 'a' from them. This has the advantage that it is simpler and will handle absolutely any character value at all if you decide to stop filtering the characters.

stevengj · October 24, 2020, 2:58am

If you are comparing letter frequencies for English text to a non-English language like Portugese, you probably want to strip diacritical marks from the data:

julia> data = "Portugal, oficialmente República Portuguesa, é um país soberano unitário localizado no sudoeste da Europa, cujo território se situa na zona ocidental da Península Ibérica e em arquipélagos no Atlântico Norte.";

julia> using Unicode

julia> Unicode.normalize(data, stripmark=true, casefold=true)
"portugal, oficialmente republica portuguesa, e um pais soberano unitario localizado no sudoeste da europa, cujo territorio se situa na zona ocidental da peninsula iberica e em arquipelagos no atlantico norte."

Topic		Replies	Views
Char vs. String for Dict key New to Julia	6	1877	August 11, 2017
How to retrieve Char in C New to Julia question	5	563	March 17, 2020
1 + 'a' = 'b' General Usage	81	1921	March 17, 2022
I am probably missing something very obvious. Can someone help me understand this? General Usage	1	358	February 13, 2021
How to count all unique character frequency in a string? New to Julia question , statistics , strings	25	12046	January 8, 2019

How can I do arithmetics with Char objects?

Related topics