Reading '¤' separated value file with readdlm

csv

#1

I was given a CSV file to study, it contains strings (words enclosed with ") separated by the currency sign '¤' (don’t ask me why…), and when I used readdlm to load its content, the first quote sign is considered to belong to the separator whereas the second one is considered to belong to the actual content, here is a MWE:

f = open("foo.csv", "w")
write(f, "\"a\"¤\"b\"¤\"c\"\n")
write(f, "\"here is a content\"¤\"here is another one\"¤\"this is enough\"\n")
close(f)

using DelimitedFiles

x, h = readdlm("foo.csv", '¤'; header = true)
@show x[1,1]

I can’t find a way to parse it correctly with readdlm, any idea?
Many thanks!


#2

FWIW,

using CSV
h = CSV.read("foo.csv", delim =  '¤')
@show x[1,1]
  #  x[1, 1] = "here is a content"

parses the file correctly


#3

Thanks a lot! I have recently been using readdlm a lot because for some unknown reason CSV.read was extremely slow on previous files with which I was working. But it’s actually fast on these new files, so this solves my problem.


#4

Yeah, the issue here is I don’t think readdlm supports non-ascii delimiters (lots of csv readers don’t). It’s actually newish funcitonality in CSV (as of last fall). If you ever have performance issues w/ CSV.jl, please share! Post here on discourse or open an issue at the JuliaData/CSV.jl repo and I’m happy to help figure out what’s going on.


#5

Thanks @quinnj! I’ll try to find these old files and benchmark them with CSV.read and readdlm and, if I manage to reproduce the problems I had, post the results either here and tagging you or on the CSV repo.


#6

If it is an old file, I would suspect it is simply latin-1 0xa4, instead of an UTF8 0xc2 0xa4. Does CSV support non-UTF8 encodings?

Frankly, I would just fix the file with tr or a similar tool to have commas, instead of extending support for these cases.