CSV ruins scientific notation

Wikunia · June 29, 2019, 8:39am

I think it’s a bug in CSV but I’m not 100% sure that’s why I ask here first:

I wrote some code to convert a txt file into a csv file (basically adding a header and commata).

My code:

fname = input_name = split(split(txt_file,"/")[end], ".")[1]

header = ["x","y","z","d"]

df = CSV.read(txt_file, delim=" ", header=header)
println(df)
CSV.write("../inputs/csv/"*fname*".csv", df)

The txt file:

1234.56789 4232.45455455 -8789.5455 1.0121325235677855e-11

The print out:


fname: test_123
1×4 DataFrame
│ Row │ x       │ y       │ z        │ d           │
│     │ Float64 │ Float64 │ Float64  │ Float64     │
├─────┼─────────┼─────────┼──────────┼─────────────┤
│ 1   │ 1234.57 │ 4232.45 │ -8789.55 │ 1.01213e-11 │

The csv file:

x,y,z,d
1234.56789,4232.45455455,-8789.5455,10121325235677855e-27

It should have the same scientific representation (the true scientific notation maybe if the other one has more than one digit in front of the decimal point) in my opinion.

Tamas_Papp · June 29, 2019, 8:53am

Is your problem the lack of read-write invariance (which I guess cannot be helped), or that the scientific notation is not normalized, eg as in this MWE:

using CSV, Tables
x = 10121325235677855e-27
io = IOBuffer()
CSV.write(io, [(x = x, )])
String(take!(io))

Wikunia · June 29, 2019, 8:56am

I think CSV should not change the scientific notation or if not possible normalize it to the standard scientific notation but in my case it seems to have a different “standard”

Tamas_Papp · June 29, 2019, 9:18am

Note that eg a Float64 representation does not contain this information, eg

julia> parse(Float64, "1.1") == parse(Float64, "11e-1")
true

so this expectation is a bit unrealistic.

Note that it is called normalized notation, but it is not standard in any sense (though quite common). All representations are equally valid.

I wonder if this is a practical concern. Doing it the way it is currently implemented is extremely efficient and also happens to be valid. Is there a program that has problems with the output?

StefanKarpinski · June 29, 2019, 10:04am

The integer value to a large power is a bit weird. I wonder if it’s due to printing the shortest text representation of the value. That seemed like a good idea at one point but seems a bit weird now.

Tamas_Papp · June 29, 2019, 10:57am

I think it comes from this this PR, and the rationale seems to be being able to use the output from grisu directly.

I am fine with it either way. I am also more use to eyeballing normalized notation, but I rarely look at CSV directly (usually when I am debugging), so I as long as it is valid, it should be fine.

Wikunia · June 29, 2019, 1:21pm

Thanks for your reply. It’s valid just seems odd. I shouldn’t have called it bug it’s more unexpected behavior but it’s okay if it stays the way it is.

pfarndt · July 1, 2019, 10:32am

Just to have a different option here - I look at CSV files a lot - all my outputs are basically in CSV - and I often look into them to check whether or how fast some algorithm converged (one may call this debugging ). For this purpose only the order of magnitude of some numbers is important. Now it would be really annoying if I now have to count digits before the . and add this number to exponent before I can come to a conclusion about the order of magnitude of a number. This is why we use normalized notation.

eljungsk · July 1, 2019, 11:37am

I had this problem recently when trying to read data generated in Julia into a commercial CFD simulation software. The data were silently corrupted, and it took me a while to figure out the reason since most of the numbers looked reasonable when inspecting them in the CFD package.

Tamas_Papp · July 1, 2019, 12:52pm

In this case, I think the best route is to open an issue for this package to emit normalized floats.

steph_de_paris · July 17, 2019, 5:52pm

See a solution (not really “smart”) proposed in New behaviour due to an update of the package CSV when using CSV.write - #8 by steph_de_paris

I just use a cast Float 2 String to keep the scientific notation used in the DataFrames.

Topic		Replies	Views
New behaviour due to an update of the package CSV when using CSV.write Performance question	7	1227	July 17, 2019
Formatting Float64 output with CSV.write() General Usage csv	11	5516	April 26, 2019
CSV.jl open file error, contain data in scientific notation format General Usage dataframes , csv	4	1312	May 4, 2018
Bug in CSV.read? General Usage	7	610	March 5, 2020
CSV.read with really small decimal value Data	3	627	September 3, 2018

CSV ruins scientific notation

Related topics