How to specify thousand separator with CSV.read data

My Excel spread uses apostrophes to separate thousands and even worse it uses" ’ " instead of straight apostroph e.g. 1’000. Is there a way to specify this in CSV.read()? Because when I import data it interprets 1’000 as 1\x92000.

Excel Data comes in the form of:
Amount;Date;Time
1’000;10.01.2022;13:22:17
6’000;10.01.2022;13:20:12
3’000;10.01.2022;13:23:08

then I use
df=CSV.read(“filepath.csv”, delim =“;”, dateformat= “dd.mm.yyyy”, normalizenames= true, DataFrame);

println(df) shows the Amount as String7 which prohibits me from doing any calculation with the data.

image

Welcome to the community! Can you post an example of the data that you’re trying to read so that others can easily copy/paste it as they attempt to help? Thanks!

There is not:

As a workaround, you can just remove the separator and parse the resulting string as a number. In the DataFrames.jl mini-language:

julia> transform!(df, :Amount =>  ByRow(str -> parse(Int, replace(str, '’' => ""))) => :Amount)
3×3 DataFrame
 Row │ Amount  Date        Time
     │ Int64   Date        String15
─────┼──────────────────────────────
   1 │   1000  2022-01-10  13:22:17
   2 │   6000  2022-01-10  13:20:12
   3 │   3000  2022-01-10  13:23:08

or more simply

df.Amount = parse.(Int, replace.(df.Amount, '’' => ""))
1 Like

You should also be able to use CSV.read: read the file into a string, delete the delimiter, and do CSV.read on an in-memory buffer (IOBuffer):

s = filter(!=('’'), read("filepath.csv", String))
CSV.read(CSV.File(IOBuffer(s)), delim=';', ....)

Thanks a lot for the suggestions.

@drvi just added support for this in the Parsers.jl package, so now we just need to plumb support for this in the CSV.jl package if someone wanted a pretty easy first issue to try out.

2 Likes

Just so whoever stumbles across this problem knows. The issue is fixed by Support groupmark by LilithHafner · Pull Request #1093 · JuliaData/CSV.jl · GitHub and according to this example, you can now do this.

using CSV

# In many places in the world, digits to the left of the decimal place are broken into
# groups by a thousands separator. We can ignore those separators by passing the `groupmark`
# keyword argument.
data = """
x y
1 2
2 1,729
3 87,539,319
"""

file = CSV.File(IOBuffer(data); groupmark=',')
1 Like