String encoding with DelimitedFiles

I’m importing some files with DelimitedFiles and I keep getting some weird characters in the header. For example:

julia> using DelimitedFiles

julia> csv, names = readdlm("data1.csv", ',', header=true)

julia> names
1×5 Matrix{AbstractString}:
 "\ufefft"  "c"  "hhol"  "vars"

The first component of the names matrix should be "t", rather than "\ufefft". I understand \ufeff is an indicator of U+FEFF encoding. How can I pass that to readdlm or parse it to eliminate that indicator in the resulting names matrix?

1 Like

0xFEFF encodes UTF-16 big-endian data in the file.
I wonder why you get the strings as you get them, so my guess is, that your input file is wrong by having the UTF-16 BOM (byte order mark) but actual data isn’t.

But perhaps I am wrong and the following solution works for you:

using StringEncodings, DelimitedFiles
csv, names = readdlm( open(read, "data1.csv", enc"UTF-16"), ',', header=true)

names is now this:

julia> names
1×1 Matrix{AbstractString}:
 "\uefbb뽴Ᵽⱨ桯氬来湤ⱡ来摩昍\ua34㈬㘸ⰴⰱⰱ㈍ਸ਼㠬㘸ⰴⰰⰱ㈍ਲ਼㔬㌵ⰸⰱⰴഊ㌵ⰳ㔬㠬〬㐍ਲ㤬㈹ⰱ〬ㄬ㠍\ua31㐬㈹ⰱ〬〬㠍\ua37㘬㜶ⰲㄬㄬㄲഊ㜶ⰷ㘬㈱ⰰⰱ㈍\ua37㔬㜵ⰲ㐬ㄬ」ਵ㔬㜵ⰲ㐬〬」\ua34㘬㐶ⰲ㠬ㄬ㈷ഊ㐶ⰴ㘬㈸ⰰⰲ㜍ਸ㌬ㄳ㜬㈹ⰱⰭ㌍ਸ㔬ㄳ㜬㈹ⰰⰭ㌍ਲ〬㈰ⰳ㔬ㄬⴱഊ㈰ⰲ〬㌵ⰰⰭㄍਵㄬ㤴ⰳ㘬ㄬⴱഊ㤴ⰹ㐬㌶ⰰⰭㄍ\ua34㤬㐹ⰳ㠬ㄬ㌍\ua34㤬㐹ⰳ㠬〬㌍ਸ਼㔬㘵ⰴ㈬ㄬ㜍ਸ਼㔬㘵ⰴ㈬〬㜍ਲ㔬㈵ⰴ㘬ㄬㄍਲ㔬㈵ⰴ㘬〬ㄍ\ua37㜬ㄲ㐬㐹ⰱⰱ㘍\ua37㜬ㄲ㐬㐹ⰰⰱ㘍\ua34㌬㐳ⰵ〬ㄬㄳഊ㐳ⰴ㌬㔰ⰰⰱ㌍ਵㄬ㜲ⰵ㈬ㄬㄶഊ㔲ⰷ㈬㔲ⰰⰱ㘍\ua31㐬ㄴⰵ㌬ㄬ㤍\ua31㐬ㄴⰵ㌬〬㤍ਸㄬ㠱ⰵ㔬ㄬㄷഊ㔸ⰸㄬ㔵ⰰⰱ㜍ਲ㐬㈴ⰵ㜬ㄬ㜍ਲ㐬㈴ⰵ㜬〬㜍\ua31㤬ㄹⰶ〬ㄬ㌍\ua31㤬ㄹⰶ〬〬㌍ਲ਼㐬㌴ⰶ㔬ㄬ㈰ഊ㌴ⰳ㐬㘵ⰰⰲ」ਸㄬ㠱ⰶ㜬ㄬ㈲ഊ㐲ⰸㄬ㘷ⰰⰲ㈍\ua34㐬㐴ⰷ㔬ㄬ\u3130ഊ㐴ⰴ㐬㜵ⰰⰱ」ਵ〬㔰ⰷ㘬ㄬⴳ㠍ਵ〬㔰ⰷ㘬〬ⴳ㠍ਲ਼㤬\u3130㌬㜷ⰱⰳㄍ\ua34㐬\u3130㌬㜷ⰰⰳㄍਵ㌬㔳ⰸ〬ㄬㄴഊ㔳ⰵ㌬㠰ⰰⰱ㐍ਹⰹⰸ㔬ㄬ」ਹⰹⰸ㔬〬」ਸ਼㈬㘲ⰹ㈬ㄬ㐍ਸ਼㈬㘲ⰹ㈬〬㐍\ua31ㄲⰱㄶⰹ㘬ㄬⴲ㤍\ua37㌬ㄱ㘬㤶ⰰⰭ㈹ഊ㔱ⰵㄬ㤸ⰱⰵഊ㔱ⰵㄬ㤸ⰰⰵഊ㔷ⰹ㘬\u3130㐬ㄬ㔍\ua37㌬㤶ⰱ〴ⰰⰵഊ㤱ⰱ㌷ⰱ〸ⰱⰱ㐍ਹ〬ㄳ㜬\u3130㠬〬ㄴഊ㘳ⰶ㌬ㄱㄬㄬⴲ」ਸ਼㌬㘳ⰱㄱⰰⰭ㈰ഊ㘱ⰱ㈲ⰱㄹⰱⰹഊ㘳ⰱ㈲ⰱㄹⰰⰹഊ㔵ⰵ㔬ㄲㄬㄬㄷഊ㔵ⰵ㔬ㄲㄬ〬ㄷഊ㈷ⰲ㜬ㄲ㔬ㄬⴲഊ㈷ⰲ㜬ㄲ㔬〬ⴲഊ㔵ⰵ㔬ㄲ㘬ㄬ㈍ਲ਼㈬㔵ⰱ㈶ⰰⰲഊ㔹ⰱ〵
...

Well, I don’t know, what your data1.csv is, but 0xEFBB is the starting BOM of UTF-8 (EF BB BF), it seems your data1.csv has somehow changed in the meanwhile.

Can you open your original data1.csv with an HexEditor and report the first few bytes?
I am expecting:
FE FF 74 2C 63 2C 68 68 6F …
from your original post and which would be a wrong file encoding. The right one with FE FF woud be:
FE FF 00 74 00 2C 00 63 00 2C 00 68 00 68 00 6F

ascii 74 = ‘t’
2C = ‘,’
63=‘c’

This is how it starts:

00000000: efbb bf74 2c63 2c68 686f 6c2c 6765 6e64  ...t,c,hhol,gend
00000010: 2c61 6765 6469 660d 0a34 322c 3638 2c34  ,agedif..42,68,4

I’ve uploaded the file here. Github seems to parse it without problem.

I see, its a UTF-8 BOM file.
With StringEncodings I get an error:

julia> readdlm(open(read, "new.txt", enc"UTF-8 BOM"),',')
ERROR: Conversion from UTF-8 BOM to UTF-8 not supported by iconv implementation, check that specified encodings are correct

Would CSV.jl and DataFrames a viable way for you?

using DataFrames, CSV
data=DataFrame(CSV.File("data1.csv"))
1 Like

That could work. Do you know if there’s a way to extract the data from the DataFrame into an Array without forming manually? I guess in the worst case I could do:

mat = [df.col1 df.col2]
mat = Matrix(df)
1 Like

haha, the most obvious answer. Thank you.

1 Like