, and . in numbers inside CSV format

Hey there!

I have to following problem. I want to read in data ( .csv ) which has the following structure:

data = """
col1;col2;col3;col4;col5
"05.02.2023";1000,01;2000,02;3000,03;12:00:00
"06.02.2023";4000,04;5000,05;6000,06;12:00:00
"""

This data can be easily read, by stating that ‘,’ is used as a decimal instead of ‘.’

file = CSV.File(IOBuffer(data); delim=';', decimal=',', dateformat="dd.mm.yyyy")

2-element CSV.File:
 CSV.Row: (col1 = Dates.Date("2023-02-05"), col2 = 1000.01, col3 = 2000.02, col4 = 3000.03, col5 = String15("12:00:00"))
 CSV.Row: (col1 = Dates.Date("2023-02-06"), col2 = 4000.04, col3 = 5000.05, col4 = 6000.06, col5 = String15("12:00:00"))

However my data looks like this:

data = """
col1;col2;col3;col4;col5
"05.02.2023";1.000,01;2.000,02;3.000,03;12:00:00
"06.02.2023";4.000,04;5.000,05;6.000,06;12:00:00
"""

There is an extra ‘.’ which separates digits of 1000, like it’s implemented in Julia with ‘_’ underscore. So 1000,0 = 1.000,0 = 1_000,0

Is there a way of handling such problems easily, without needing to do something like this":
-read file line by line
-do this where needed: replace(x, “.” => “”)
-save file
-then read in with CSV

Thanks for your help!

It is on a to-do list Ability to specify a thousands separator character · Issue #626 · JuliaData/CSV.jl · GitHub.

For now pass custom parser for this column. The parser can be from Parsers.jl:

julia> CSV.Parsers.parse(Float64, "1.234,56", CSV.Parsers.Options(groupmark='.', decimal=','))
1234.56

This will not be currently super convenient unfortunately.

2 Likes

It seemed straightforward enough to make CSV.jl support this so I proposed a PR. I think it might land soon to the master branch.

3 Likes

Unless the file is huge, you could just read it into a buffer, do the replacement, and read the CSV from an IOBuffer, rather than saving as another file. You can even do this in-place. e.g.

buf = filter!(!==(UInt8('.')), read("myfile.dat"))
file = CSV.File(IOBuffer(buf); delim=';', decimal=',')
1 Like

Ah good to know. This would be a great feature!

And thanks for the advice with Parsers.jl, but as you said this is a little inconvenient.
I’ll stick with the idea from @stevengj , this works ok, until such a feature is implemented in CSV.jl directly. :slight_smile:

Thanks for the idea! This works well enough for me :slight_smile:
My files are not that big, the only disadvantage is that I also filter out the ‘.’ of the date, but this is not so bad.

Thanks!

This will work assuming you do not use '.' anywhere else in the file (@mpirke has mentioned that you encode dates as 05.02.2023 which would affect them)

A straightforward solution is to read the table into memory as-is, and then process remaining string columns.

julia> tbl = CSV.File(IOBuffer(data); delim=';', dateformat="dd.mm.yyyy") |> columntable

julia> tbl = map(tbl) do xs
           map(xs) do x
               x isa AbstractString && occursin(r"^[\d.,]+$", x) ?
                  parse(Float64, replace(x, '.' => "", ',' => '.')) :
                  x
           end
       end

This keeps datetimes and other columns that were successfully parsed by CSV.
Kinda assumes that either all strings in a column match, or none do — otherwise you’ll get a mix of strings and floats.

The same code above works for both columntables and rowtables, as well as other table types like StructArrays.