Need Help with accented characters

Hello, i’m new in Julia and i’m facing a problem with accented characters like ã or á.

I loaded a dataframe from my SQLServer Database using ODBC and i’m having trouble to Uppercase the column “unidade”, i think this occurs because Julia doesn’t dettected the accented characters mentioned above.

There is my code:

using Pkg;
Pkg.add("DataFrames"); using DataFrames;
Pkg.add("CSV"); using CSV;
Pkg.add("ODBC"); using ODBC;

connection_string = "DRIVER="*driver*";PORT=port;SERVER="*server*";PORT=1443;DATABASE="*database*";UID="*username*
                  ";PWD="*password

cnn = ODBC.Connection(connection_string)

dataInicial = """ "2022/02/01" """

dataFinal = """ "2022/02/28" """

query_fechamentosSangrias = ("SELECT data, unidade, SUM(TOTAL_FECHAMENTO) AS 
    [Total Fechamento], SUM(TOTAL_FECHAMENTO_itens) AS [Total Fechamento itens], SUM(total_sangria) AS [Total Sangrias],
    SUM(TOTAL_parps) AS [Total parps], SUM(TOTAL_pendencia) AS [Total pendencia] FROM OPENQUERY(SIDGLINKEDSERVER,
    ' SELECT fechamento.id AS fechamento_id, unidade.nome AS unidade,ifnull(ROUND(fechamento.valor,2),0) AS total_fechamento,
    ifnull(itens.total_fechamento_itens,0) AS total_fechamento_itens,ifnull(ROUND(sangria.valor,2),0) AS total_sangria,
    ifnull(ROUND(parps.valor,2),0) AS total_parps, ifnull(ROUND(pendencia.valor,2),0) AS total_pendencia,
    DATE(fechamento.data_fechamento) AS data FROM sidg.fechamento LEFT JOIN (SELECT SUM(total_fechamento) AS total_fechamento_itens,fechamento_id FROM 
    (SELECT ifnull(ROUND(SUM(fechamento_itens.total),2),0) as total_fechamento, fechamento_id,produto_id FROM sidg.fechamento_itens where 
    DATE(createdat) >= ''2021-01-01'' GROUP BY fechamento_id,produto_id) AS A WHERE produto_id IN 
    (SELECT id FROM sidg.produtos WHERE agrega = 1) GROUP BY fechamento_id) AS itens ON itens.fechamento_id = fechamento.id 
    LEFT JOIN (SELECT IFNULL(ROUND(SUM(sangria.valor),2),0) AS valor,fechamento_id 
    FROM sidg.sangria GROUP BY fechamento_id) AS sangria ON sangria.fechamento_id = fechamento.id 
    LEFT JOIN (SELECT SUM(valor*-1) AS valor,fechamento_id FROM sidg.pendencia_financeira 
    GROUP BY fechamento_id) AS pendencia ON pendencia.fechamento_id = fechamento.id 
    LEFT JOIN (SELECT SUM(valor) AS valor,fechamento_id FROM sidg.deposito_parp 
    GROUP BY fechamento_id) AS parps ON parps.fechamento_id = fechamento.id 
    LEFT JOIN sidg.unidade ON unidade.id = fechamento.unidade_id 
    WHERE DATE(fechamento.data_fechamento) BETWEEN $dataInicial AND $dataFinal ') 
    WHERE fechamento_id <> 463409 GROUP BY data,unidade ORDER BY data")
fechamentosSangrias = DBInterface.execute(cnn,query_fechamentosSangrias) |> DataFrame

fechamentosSangrias

Here my dataframe:

data	unidade	Total Fechamento	Total Fechamento itens
Date	String?	Float64?	Float64?
1	2022-02-01	Loja Augusta	3361.2	3361.2
2	2022-02-01	Parada Alberto Lion	3534.05	3534.05
3	2022-02-01	Parada Ana Nery	4790.25	4790.25
4	2022-02-01	Parada C. A. Ypiranga	6543.3	6543.3
5	2022-02-01	Parada N. Sra. Aparecida	3380.05	3380.05
6	2022-02-01	Parada Pedro II	7212.1	7212.1
7	2022-02-01	Parada Rua do Grito	4861.7	4861.7
8	2022-02-01	Posto Jabaquara	6749.4	6749.4
9	2022-02-01	Posto Santana	6397.87	6397.87
10	2022-02-01	Terminal AE Carvalho	19624.8	19624.8
11	2022-02-01	Terminal \xc1gua Espraiada	3969.05	3969.05
12	2022-02-01	Terminal Amaral Gurgel	12186.1	12186.1
13	2022-02-01	Terminal Aricanduva	7685.3	7685.3
14	2022-02-01	Terminal Bandeira	21697.5	21697.5
15	2022-02-01	Terminal Cachoeirinha	31213.7	31213.7
16	2022-02-01	Terminal Campo Limpo	44532.9	44532.9
17	2022-02-01	Terminal Capelinha	14718.3	14718.3
18	2022-02-01	Terminal Casa Verde	9275.15	9275.15
19	2022-02-01	Terminal Cidade Tiradentes	28693.5	28693.5
20	2022-02-01	Terminal Graja\xfa	37474.2	37474.2

As you can see in the line 11 and line 20 have a problem in:

“Terminal \xc1gua Espraiada” the correct is “Terminal Água Espraiada”
“Terminal Graja\xfa” the correct is “Terminal Grajaú”

Why this happens? And how can i solve this?

What is the “?” in the columns datatypes?

Thanks Guys!

Most likely your database does not serve the data in UTF-8 format. You need to change the encoding, e.g. using GitHub - JuliaStrings/StringEncodings.jl: String encoding conversion in Julia using iconv.

Your column types allow missing values.

1 Like

Your database is probably using ISO-8859-1 encoding. Checkout the package bkamins suggested.

1 Like

Thx everybody!

But i’m stuck in how i can convert that column to the correct encoding with that package.

I’m trying with:

mapcols(x -> StringDecoder.(fechamentosSangrias.unidade, "UTF-16"), fechamentosSangrias)

that return an error:

MethodError: no method matching StringDecoder(::String, ::String)
Closest candidates are:
  StringDecoder(::IO, ::AbstractString) at C:\Users\Pedro Henrique\.julia\packages\StringEncodings\hHXRr\src\StringEncodings.jl:282
  StringDecoder(::IO, ::AbstractString, ::AbstractString) at C:\Users\Pedro Henrique\.julia\packages\StringEncodings\hHXRr\src\StringEncodings.jl:284
  StringDecoder(::IO, ::AbstractString, ::Encoding) at C:\Users\Pedro Henrique\.julia\packages\StringEncodings\hHXRr\src\StringEncodings.jl:282

And with

transform(fechamentosSangrias, :unidade => ByRow(x -> StringEncoder(x, "UTF-16")) => :unidade)

that returns that error

MethodError: no method matching StringEncoder(::String, ::String)
Closest candidates are:
  StringEncoder(::IO, ::AbstractString) at C:\Users\Pedro Henrique\.julia\packages\StringEncodings\hHXRr\src\StringEncodings.jl:208
  StringEncoder(::IO, ::AbstractString, ::AbstractString) at C:\Users\Pedro Henrique\.julia\packages\StringEncodings\hHXRr\src\StringEncodings.jl:210
  StringEncoder(::IO, ::AbstractString, ::Encoding) at C:\Users\Pedro Henrique\.julia\packages\StringEncodings\hHXRr\src\StringEncodings.jl:208

Other Try:

transform(fechamentosSangrias, :unidade => ByRow(x -> encode(x,"UTF-8")) => :unidade)

Error:

Byte sequence 0xc167756120457370726169616461 is invalid in source encoding or cannot be represented in target encoding

What are this errors?

How is the correct way?

Sorry but i’m really newbie in that language.

There probably is a better way. What you read is a sequence of bytes interpreted as UTF-8. That’s what a String is. So you need to decode it. But to do that the function decode in StringEncodings needs a byte buffer, that is Vector{UInt8}.

To convert a single string, the following works:

julia> using StringEncodings

julia> s
"avi\xe3o"

julia> decode(Vector{UInt8}(s), "ISO-8859-1")
"avião"


1 Like

Really thanks for you explanation!

I changed my transform function to

transform(fechamentosSangrias, :unidade => ByRow(x -> decode(Vector{UInt8}(x),"ISO-8859-1")) => :unidade)

And Works!

Look now the lines 11 and 20 are in the correct format!

fechamentosSangrias[[11,20],:]
2 rows × 7 columns (omitted printing of 2 columns)

data	unidade	Total Fechamento	Total Fechamento itens	Total Sangrias
Date	String	Float64?	Float64?	Float64?
1	2022-02-01	Terminal Água Espraiada	3969.05	3969.05	3964.15
2	2022-02-01	Terminal Grajaú	37474.2	37474.2	37495.2

Really thanks for your help @bkamins and @Paulo_Jabardo

2 Likes