Formatting Float64 output with CSV.write()

Alessio_Bellisomi · April 24, 2019, 5:19pm

Hello,
I am using the CSV package to export a DataFrame. One of the columns has DataType Float64, so it get exported using scientific notation format and the software that reads it (SQL Server bcp) cannot read it properly. I have tried to use the typemap options, and I don’t think it is the right way.

Ideally, I would like to limit the number of decimal to 8 and write to CSV the number like in the following example:

1234567.1234567890123456
to
1234567.12345678
instead of
1.2345671234567890123456e6

Here a simple test case:

df = DataFrame(Test =1e6*rand():1e6:4e6 )
CSV.write("C:\\temp\\test.csv",df)

This will print something similar to the following

Test
848641.331046621
1.8486413310466208e6
2.848641331046621e6
3.848641331046621e6

Ideally I would like to have

Test
848641.33104662
1848641.33104662
2848641.33104662
3848641.33104662

Any idea?

Thanks

stevengj · April 24, 2019, 6:17pm

Could you give more information? The Microsoft SQL bcp documentation says that scientific notation is supported, but it looks like it might need a sign after e, i.e. 1.23e+6 and not 1.23e6. If that’s the only problem, it would be easy to add the + via post-processing.

Alessio_Bellisomi · April 25, 2019, 11:36am

@stevengj, you are indeed right, BCP does not have issues with the scientific notation (regardless of ± sign)
The issue I am facing is caused by the target field which is a DECIMAL(26,8) in the database. In the CSV file generated it is much longer as it comes from a Float64.
Something like this will fail:

289126.52955090767
#@ Row 1, Column 1: String data, right truncation @#

Something like this would succeed

289126.52955090

From this, the need to truncate the 8th decimal

stevengj · April 25, 2019, 12:30pm

One option would be to just loop over rows and columns and output CSV yourself — it’s not a complicated format to write.

Using the Printf package, you can output in whatever format you want, e.g.

julia> @printf("%0.8f", 1234567.1234567890123456)
1234567.12345679

Alessio_Bellisomi · April 25, 2019, 12:46pm

Steve,
That is what we have in place right now and I am trying to replace it with something more performant.
The CSV package is ~y times faster than the rows/columns loop

tk3369 · April 25, 2019, 1:17pm

This package should be more performant than Printf.

https://github.com/JuliaString/Format.jl

kristoffer.carlsson · April 25, 2019, 2:35pm

Could you share your benchmark?

Alessio_Bellisomi · April 25, 2019, 3:27pm

Tom,
I edited the existing code to use the package you recommended.
Before it was using @sprintf (see the commented code).

fmt = "%.8f"
open(file_path1,"w") do f
  for irow in 1:size(df)[1]
    for icol in 1:size(df)[2]
      if typeof(df[irow,icol]) == Float64
        #write(f, @sprintf "%0.8f" df[irow,icol])
        write(f, cfmt(fmt, df[irow,icol]))
      else
        write(f, string(df[irow,icol]))
      end
        write(f, ",")
    end           
  end

The results are not what I expected as the execution time went from

17.255884 seconds

to

423.600634 seconds

So please let me know if I did something wrong in the above code.

ps With the CSV library it takes just ~2.5 seconds

stevengj · April 25, 2019, 3:30pm

Rather than @sprintf, which allocates a string that you then write to f, you should do

@printf(f, "%0.8f" df[irow,icol])

to write directly to the stream f.

Similarly, don’t do write(f, string(df[irow,icol])), do

print(f, df[irow,icol])

In general, use print for writing text, and write for raw binary representations.

Also, don’t benchmark in global scope. Put all performance-critical code into a function.

Alessio_Bellisomi · April 25, 2019, 4:09pm

Steve,
Thanks for the advice - I implemented the changes you recommended, unfortunately the impact is very low, in the range of a second.
I tried to remove the @printf(f, "%0.8f", ...) and the execution time went down to ~7 seconds! Which is still twice slower than the CSV… and not valid for my BCP job…

Alessio_Bellisomi · April 25, 2019, 5:59pm

Finally,
I ended up changing the data type of the DataFrame to comply the database requirements.
I preprocess the Float64 columns this way:

df[:Test] = map( (x) -> string(round(x; digits=8)) , df[:Test])

This way the column contents are already truncated to fit the database DECIMAL(26, 8) data type we have in the database, so the BCP will not complain about the length anymore. The CSV.write() does not provide formatting flexibility, so adapting the input is, at least for now I think, the way to go.

Please shoot if you have a better way to do this!

Thanks

tk3369 · April 26, 2019, 6:33am

Good question. I haven’t tested in months but now Printf is a lot faster than before! There must have been huge improvements recently.

Here’s my test result:

julia> x = rand(100_000);

julia> foo(x) = open("/tmp/test.csv", "w") do f 
           for v in x
               @printf(f, "%16.8f\n",v)
           end
       end
foo (generic function with 1 method)

julia> goo(x) = open("/tmp/test2.csv", "w") do f 
           fspec = FormatExpr("{:16.8f}")
           for v in x
               printfmtln(f, fspec, v)
           end
       end
goo (generic function with 1 method)

julia> @btime foo($x)
  36.167 ms (11 allocations: 688 bytes)

julia> @btime goo($x)
  52.566 ms (500032 allocations: 9.92 MiB)

Topic		Replies	Views
New behaviour due to an update of the package CSV when using CSV.write Performance question	7	1226	July 17, 2019
Writing CSV files with fixed decimal format Data question , package , csv , formatting	7	766	November 22, 2023
CSV ruins scientific notation General Usage dataframes , csv	10	2664	July 17, 2019
Formatted output in file based on DataFrames and CSV New to Julia	2	539	November 8, 2018
Converting Float64 into String but in scientific notation General Usage	3	5620	August 18, 2018

Formatting Float64 output with CSV.write()

Related topics