CSV ruins scientific notation

I think it’s a bug in CSV but I’m not 100% sure that’s why I ask here first:

I wrote some code to convert a txt file into a csv file (basically adding a header and commata).

My code:

fname = input_name = split(split(txt_file,"/")[end], ".")[1]

header = ["x","y","z","d"]

df = CSV.read(txt_file, delim=" ", header=header)
println(df)
CSV.write("../inputs/csv/"*fname*".csv", df)

The txt file:

1234.56789 4232.45455455 -8789.5455 1.0121325235677855e-11

The print out:


fname: test_123
1×4 DataFrame
│ Row │ x       │ y       │ z        │ d           │
│     │ Float64 │ Float64 │ Float64  │ Float64     │
├─────┼─────────┼─────────┼──────────┼─────────────┤
│ 1   │ 1234.57 │ 4232.45 │ -8789.55 │ 1.01213e-11 │

The csv file:

x,y,z,d
1234.56789,4232.45455455,-8789.5455,10121325235677855e-27

It should have the same scientific representation (the true scientific notation maybe if the other one has more than one digit in front of the decimal point) in my opinion.

Is your problem the lack of read-write invariance (which I guess cannot be helped), or that the scientific notation is not normalized, eg as in this MWE:

using CSV, Tables
x = 10121325235677855e-27
io = IOBuffer()
CSV.write(io, [(x = x, )])
String(take!(io))

I think CSV should not change the scientific notation or if not possible normalize it to the standard scientific notation but in my case it seems to have a different “standard”

Note that eg a Float64 representation does not contain this information, eg

julia> parse(Float64, "1.1") == parse(Float64, "11e-1")
true

so this expectation is a bit unrealistic.

Note that it is called normalized notation, but it is not standard in any sense (though quite common). All representations are equally valid.

I wonder if this is a practical concern. Doing it the way it is currently implemented is extremely efficient and also happens to be valid. Is there a program that has problems with the output?

The integer value to a large power is a bit weird. I wonder if it’s due to printing the shortest text representation of the value. That seemed like a good idea at one point but seems a bit weird now.

2 Likes

I think it comes from this this PR, and the rationale seems to be being able to use the output from grisu directly.

I am fine with it either way. I am also more use to eyeballing normalized notation, but I rarely look at CSV directly (usually when I am debugging), so I as long as it is valid, it should be fine.

4 Likes

Thanks for your reply. It’s valid just seems odd. I shouldn’t have called it bug it’s more unexpected behavior but it’s okay if it stays the way it is.

Just to have a different option here - I look at CSV files a lot - all my outputs are basically in CSV - and I often look into them to check whether or how fast some algorithm converged (one may call this debugging :wink: ). For this purpose only the order of magnitude of some numbers is important. Now it would be really annoying if I now have to count digits before the . and add this number to exponent before I can come to a conclusion about the order of magnitude of a number. This is why we use normalized notation.

1 Like

I had this problem recently when trying to read data generated in Julia into a commercial CFD simulation software. The data were silently corrupted, and it took me a while to figure out the reason since most of the numbers looked reasonable when inspecting them in the CFD package.

In this case, I think the best route is to open an issue for this package to emit normalized floats.

1 Like

See a solution (not really “smart”) proposed in New behaviour due to an update of the package CSV when using CSV.write

I just use a cast Float 2 String to keep the scientific notation used in the DataFrames.