How can I do arithmetics with Char objects?

Hi, I’m trying to subtract to char, but goting this issue - ERROR: InexactError: trunc(UInt32, -65)
Please, may someone help?

function generate_frequencies(data::String)

    data = replace(strip(lowercase(data)), "\n" => "")

    num_count = 0

    frequencies = Dict{Char, Int64}

    for c in data

        println(typeof(c))

        ch =  Char(c - 'a') # this fails 

        println(typeof(ch))

        if ch >= 'a' && ch <= 'z'

            if haskey(frequencies, ch)

                frequencies[ch] += 1

                num_count += 1

            else

                frequencies[ch] = 0

            end

        end

    end

    frequencies / num_count

end

Character code points are always positive, but c - 'a' can be negative if c is a Char with a character code point less than that of a, for example if c == ':':

julia> ':' - 'a'
-39

The best way around this is probably just to work with regular signed integers instead of converting the result of this subtraction back to a Char.

2 Likes

I need work with char because I had a dictionary frequencies = Dict{Char, Int64} where I will aggregate the frequencies of chars. I cant believe this cant be done.

I had changed the code to get the default Char from c, but also, I cant even compare if the entry exists in the dictionary
no method matching haskey(::Type{Dict{Char,Int64}}, ::Char) :slightly_frowning_face:

@jcbritobr if you are building some text processing pipeline, consider checking the projects in the JuliaText organization: https://github.com/JuliaText

I will let people with more experience in text processing address the specific issue you are having with low level char operations.

1 Like

Thank you @juliohm. I’m Julio too. :slightly_smiling_face:

I found one of the issues. The dictionary was created incorrectly. In think was not initialized
This fix

no method matching haskey(::Type{Dict{Char,Int64}}, ::Char)

frequencies = Dict{Char, Int64}()

Do you really want to do this

when you test like this?

1 Like

Yep, you had a point. The logics are wrong, but the issue is conversion. I was fixing the code just right now and I get it working.

  1. Subtracting c-‘a’ I was dealing with an index, but i had a dictionary and the char is just what I need
  2. Map the result dividing all values by the num_count, to get the frequency.

Thank you all guys, for help

function generate_frequencies(data::String)

    data = replace(strip(lowercase(data)), "\n" => "")

    num_count = 0

    frequencies = Dict{Char, Int64}()

    for c in data

        if c >= 'a' && c <= 'z'

            if haskey(frequencies, c)

                frequencies[c] += 1

                num_count += 1

            else

                frequencies[c] = 0

            end

        end

    end

    map!(x -> floor(x / num_count), values(frequencies))

    frequencies

end

Julia is not so easy. We must pay attention. There is a lot of tricks, and I’m a begginer. Just started yesterday.

Do you really want to initialize the count to zero on the first occurrence? You also miss incrementing num_count on first occurrences.

Those logic errors are not inherent to Julia.

3 Likes

Its working. The code on first occurrence is just to initialize the key for letter. It must not increment. Here are my results

image

I think what @Jeff_Emanuel meant was that

should be

frequencies[c] = 1

instead.

Otherwise, what you are storing is not the number of occurrences but one less than the number of occurrences. A letter that only appears a single time will have value zero.

2 Likes

Yes. Makes all sense. You both are right. :slightly_smiling_face: Here are the plot fixed.
Because I’m a beginner, I can’t do more posts today :slightly_frowning_face:

Just to make sure it’s clear in the end, when you do c - 'a' you get an offset of the character c from 'a': this is logically an integer, not a character, because it is an offset. You can convert that offset back to a character, but that gives you the character whose Unicode code point is that offset — which doesn’t make a whole lot of sense. For example:

julia> c = 'k'
'k': ASCII/Unicode U+006B (category Ll: Letter, lowercase)

julia> c - 'a'
10
# 'k' is 10 chars after 'a'

julia> Char(c - 'a')
'\n': ASCII/Unicode U+000A (category Cc: Other, control)
# newline has code point 10

You can use these offsets as keys if you want to: then your keys for English letters will be 0 through 25. But in that case your dict should have type Dict{Int, Int} instead of Dict{Char, Int}. This would also allow using a much more efficient data structure for counting frequencies, like fill(0, 26) which is a Vector{Int}. However, in all of these cases, when you translate the data back to characters, you’ll have to do 'a' + offset to get a character back from an offset.

The other approach is what you chose yourself, which is to just use Char values as keys and not try to subtract 'a' from them. This has the advantage that it is simpler and will handle absolutely any character value at all if you decide to stop filtering the characters.

5 Likes

If you are comparing letter frequencies for English text to a non-English language like Portugese, you probably want to strip diacritical marks from the data:

julia> data = "Portugal, oficialmente República Portuguesa, é um país soberano unitário localizado no sudoeste da Europa, cujo território se situa na zona ocidental da Península Ibérica e em arquipélagos no Atlântico Norte.";

julia> using Unicode

julia> Unicode.normalize(data, stripmark=true, casefold=true)
"portugal, oficialmente republica portuguesa, e um pais soberano unitario localizado no sudoeste da europa, cujo territorio se situa na zona ocidental da peninsula iberica e em arquipelagos no atlantico norte."
3 Likes