Replace non ascii char

agatheLG · July 9, 2021, 2:54pm

Hi,
I am switching a lot between format from dataframe to csv file to sql and of course I have an error about invalid character (Base.InvalidCharError{Char}(‘\x8e’) when I try to load my data into SQL.
I can indeed see that there is non ascii char in my file “n\x8eomutation” when I print it in my julia REPL, which shows as an interrogation mark in a square in my terminal.
I have tried loads of different technique such as replace(mystring, "[^\\x00-\\x7F]" => " ") or ascii(mystring) or unidecode(mystring)… Ascii find the index of the non ascii character but throw an error and doesn’t offer the option to return this index or directly change this char. I have also tried operating the change on the dataframe and on th csv file but neither work. Do you have any idea with julia how I could remove or replace non ascii character from either a string, a file or a dataframe ?

stevengj · July 9, 2021, 3:52pm

Ideally, you would figure out why you have these characters in your data and fix the problem at the source, rather than just discarding non-ascii bytes.

Julia handles non-ascii Unicode characters just fine as long as your text is encoded in the UTF-8 encoding. Maybe you just need to tell your data source to use UTF-8? Alternatively, you can use the StringEncodings.jl package to convert from data in a different encoding.

That being said, if you want to discard non-ascii characters, that is easy to do. e.g. filter(isascii, mystring) works. (Update: I’ve added a note to the ascii(s) docs about this.) To instead replace non-ascii characters with spaces, you could do replace(mystring, !isascii=>' ').

agatheLG · July 9, 2021, 4:44pm

Great thank you so much replace(mystring, !isascii=>' ') is exactly what I was looking for but I couldn’t manage to find the appropriate syntaxe… And I completly agree, it would make so much more sense to fix the problem at the source however I don’t have a hold on the file generated at the source. Thank you again

Topic		Replies	Views
Purging utf-8 bad characters General Usage	10	3579	April 21, 2018
Need Help with accented characters General Usage question , strings , dataframes , help-database	5	1089	March 10, 2022
Covert MainFrame EBCDIC files to ASCII using Julia New to Julia question	1	692	June 17, 2020
Encoding, CSV.read / write General Usage package	5	2275	August 17, 2018
Read special characters using CSV.read New to Julia csv	22	1054	October 11, 2023

Replace non ascii char

Related topics