Replace non ascii char

Hi,
I am switching a lot between format from dataframe to csv file to sql and of course I have an error about invalid character (Base.InvalidCharError{Char}(‘\x8e’) when I try to load my data into SQL.
I can indeed see that there is non ascii char in my file “n\x8eomutation” when I print it in my julia REPL, which shows as an interrogation mark in a square in my terminal.
I have tried loads of different technique such as replace(mystring, "[^\\x00-\\x7F]" => " ") or ascii(mystring) or unidecode(mystring)… Ascii find the index of the non ascii character but throw an error and doesn’t offer the option to return this index or directly change this char. I have also tried operating the change on the dataframe and on th csv file but neither work. Do you have any idea with julia how I could remove or replace non ascii character from either a string, a file or a dataframe ?

Ideally, you would figure out why you have these characters in your data and fix the problem at the source, rather than just discarding non-ascii bytes.

Julia handles non-ascii Unicode characters just fine as long as your text is encoded in the UTF-8 encoding. Maybe you just need to tell your data source to use UTF-8? Alternatively, you can use the StringEncodings.jl package to convert from data in a different encoding.

That being said, if you want to discard non-ascii characters, that is easy to do. e.g. filter(isascii, mystring) works. (Update: I’ve added a note to the ascii(s) docs about this.) To instead replace non-ascii characters with spaces, you could do replace(mystring, !isascii=>' ').

1 Like

Great thank you so much replace(mystring, !isascii=>' ') is exactly what I was looking for but I couldn’t manage to find the appropriate syntaxe… And I completly agree, it would make so much more sense to fix the problem at the source however I don’t have a hold on the file generated at the source. Thank you again