How to get file encoding?

Hi there,

Is there a way to find out how a file is encoded with functions?
I’m getting a text file and want to do sume stuff. In order to do right stuff, I have to know how the file is encoded, but I can’t find how to do this. I found that I could open the file, go to “save as” and what ever the standard encoding there is, is what I’m looking for, but I can’t find to way to let a program do this.
Best case scenario would be, if I could convert everything to UTF8 (in every file are only UTF8 managable characters).
I tried a little with the StringEncodings package, but it isn’t really working.
Any ideas? Help would be much appreciated.

Thanks

not really, on Linux, you can use file -i, but in general there’s no way to detect for certain.


To use StringEncodings, you need to know the name of the encoding of course:

akako@desktop ~/tmp> file -i utf.txt
utf.txt: text/plain; charset=utf-8
akako@desktop ~/tmp> iconv -t GBK utf.txt -o gbk.txt

julia> String(read("./utf.txt"))
"hello甲\n"


julia> String(read("./gbk.txt"))
"hello\xbc\xd7\n"

julia> decode(read("./gbk.txt"), "GBK")
"hello甲\n"

Thanks!

For the first I’d say it might be enough, if this works on Windows. Would this help?

To the second: I thought maybe something like checking whether everything is readable and if not changing the encoding until it is would be possible. Well I don’t know. I just found isvalid() and I know that for some not readable Chars it get’s something like “Malformed UTF-8 Character” displayed. So I thought with one of these it might be possible to check.

Any ideas if there are more restrictions? I really don’t know how to work with these files otherwise.

Python’s chardet module might help - see suggestion here. Maybe you could try calling it with PyCall.

Or with expert help, it should be faster calling the uchardet C library here.

NB: maybe there is already such Julia package?

Thanks for the tip!
It works kind of ok, I guess. I was reminded why I don’t like C, but most of it works now. I still sometimes get problems when a char can’t at least be converted to UTF-16. Does something like print(myString,enc"ENCODING") like in StringEncodings exist? Am I correct though that I won’t be able the make an .exe file when calling C?
Another thing: Imagine you have a text file, which was wrongly encoded, so now you just see something like “?äüß$§%”. If I understand the StringEncodings package correctly, I should be able to try and convert this to every known encoding and look which was is fine, right?? How???