I was given a CSV file to study, it contains strings (words enclosed with
") separated by the currency sign
'¤' (don’t ask me why…), and when I used
readdlm to load its content, the first quote sign is considered to belong to the separator whereas the second one is considered to belong to the actual content, here is a MWE:
f = open("foo.csv", "w")
write(f, "\"here is a content\"¤\"here is another one\"¤\"this is enough\"\n")
x, h = readdlm("foo.csv", '¤'; header = true)
I can’t find a way to parse it correctly with
readdlm, any idea?
h = CSV.read("foo.csv", delim = '¤')
# x[1, 1] = "here is a content"
parses the file correctly
Thanks a lot! I have recently been using
readdlm a lot because for some unknown reason
CSV.read was extremely slow on previous files with which I was working. But it’s actually fast on these new files, so this solves my problem.
Yeah, the issue here is I don’t think
readdlm supports non-ascii delimiters (lots of csv readers don’t). It’s actually newish funcitonality in CSV (as of last fall). If you ever have performance issues w/ CSV.jl, please share! Post here on discourse or open an issue at the JuliaData/CSV.jl repo and I’m happy to help figure out what’s going on.
Thanks @quinnj! I’ll try to find these old files and benchmark them with
readdlm and, if I manage to reproduce the problems I had, post the results either here and tagging you or on the
If it is an old file, I would suspect it is simply latin-1
0xa4, instead of an UTF8
0xc2 0xa4. Does CSV support non-UTF8 encodings?
Frankly, I would just fix the file with
tr or a similar tool to have commas, instead of extending support for these cases.