String encoding with DelimitedFiles

I’m importing some files with DelimitedFiles and I keep getting some weird characters in the header. For example:

julia> using DelimitedFiles

julia> csv, names = readdlm("data1.csv", ',', header=true)

julia> names
1×5 Matrix{AbstractString}:
 "\ufefft"  "c"  "hhol"  "vars"

The first component of the names matrix should be "t", rather than "\ufefft". I understand \ufeff is an indicator of U+FEFF encoding. How can I pass that to readdlm or parse it to eliminate that indicator in the resulting names matrix?

0xFEFF encodes UTF-16 big-endian data in the file.
I wonder why you get the strings as you get them, so my guess is, that your input file is wrong by having the UTF-16 BOM (byte order mark) but actual data isn’t.

But perhaps I am wrong and the following solution works for you:

using StringEncodings, DelimitedFiles
csv, names = readdlm( open(read, "data1.csv", enc"UTF-16"), ',', header=true)

names is now this:

julia> names
1×1 Matrix{AbstractString}:

Well, I don’t know, what your data1.csv is, but 0xEFBB is the starting BOM of UTF-8 (EF BB BF), it seems your data1.csv has somehow changed in the meanwhile.

Can you open your original data1.csv with an HexEditor and report the first few bytes?
I am expecting:
FE FF 74 2C 63 2C 68 68 6F …
from your original post and which would be a wrong file encoding. The right one with FE FF woud be:
FE FF 00 74 00 2C 00 63 00 2C 00 68 00 68 00 6F

ascii 74 = ‘t’
2C = ‘,’

This is how it starts:

00000000: efbb bf74 2c63 2c68 686f 6c2c 6765 6e64  ...t,c,hhol,gend
00000010: 2c61 6765 6469 660d 0a34 322c 3638 2c34  ,agedif..42,68,4

I’ve uploaded the file here. Github seems to parse it without problem.

I see, its a UTF-8 BOM file.
With StringEncodings I get an error:

julia> readdlm(open(read, "new.txt", enc"UTF-8 BOM"),',')
ERROR: Conversion from UTF-8 BOM to UTF-8 not supported by iconv implementation, check that specified encodings are correct

Would CSV.jl and DataFrames a viable way for you?

using DataFrames, CSV
1 Like

That could work. Do you know if there’s a way to extract the data from the DataFrame into an Array without forming manually? I guess in the worst case I could do:

mat = [df.col1 df.col2]
mat = Matrix(df)

haha, the most obvious answer. Thank you.

1 Like