Formatting Float64 output with CSV.write()

Hello,
I am using the CSV package to export a DataFrame. One of the columns has DataType Float64, so it get exported using scientific notation format and the software that reads it (SQL Server bcp) cannot read it properly. I have tried to use the typemap options, and I don’t think it is the right way.

Ideally, I would like to limit the number of decimal to 8 and write to CSV the number like in the following example:

1234567.1234567890123456
to
1234567.12345678
instead of
1.2345671234567890123456e6

Here a simple test case:

df = DataFrame(Test =1e6*rand():1e6:4e6 )
CSV.write("C:\\temp\\test.csv",df)

This will print something similar to the following

Test
848641.331046621
1.8486413310466208e6
2.848641331046621e6
3.848641331046621e6

Ideally I would like to have

Test
848641.33104662
1848641.33104662
2848641.33104662
3848641.33104662

Any idea?

Thanks

Could you give more information? The Microsoft SQL bcp documentation says that scientific notation is supported, but it looks like it might need a sign after e, i.e. 1.23e+6 and not 1.23e6. If that’s the only problem, it would be easy to add the + via post-processing.

1 Like

@stevengj, you are indeed right, BCP does not have issues with the scientific notation (regardless of ± sign)
The issue I am facing is caused by the target field which is a DECIMAL(26,8) in the database. In the CSV file generated it is much longer as it comes from a Float64.
Something like this will fail:

289126.52955090767
#@ Row 1, Column 1: String data, right truncation @#

Something like this would succeed

289126.52955090

From this, the need to truncate the 8th decimal

One option would be to just loop over rows and columns and output CSV yourself — it’s not a complicated format to write.

Using the Printf package, you can output in whatever format you want, e.g.

julia> @printf("%0.8f", 1234567.1234567890123456)
1234567.12345679
1 Like

Steve,
That is what we have in place right now and I am trying to replace it with something more performant.
The CSV package is ~y times faster than the rows/columns loop :sunny:

This package should be more performant than Printf.

https://github.com/JuliaString/Format.jl

Could you share your benchmark?

Tom,
I edited the existing code to use the package you recommended.
Before it was using @sprintf (see the commented code).

fmt = "%.8f"
open(file_path1,"w") do f
  for irow in 1:size(df)[1]
    for icol in 1:size(df)[2]
      if typeof(df[irow,icol]) == Float64
        #write(f, @sprintf "%0.8f" df[irow,icol])
        write(f, cfmt(fmt, df[irow,icol]))
      else
        write(f, string(df[irow,icol]))
      end
        write(f, ",")
    end           
  end

The results are not what I expected as the execution time went from

17.255884 seconds

to

423.600634 seconds 

So please let me know if I did something wrong in the above code.

ps With the CSV library it takes just ~2.5 seconds

Rather than @sprintf, which allocates a string that you then write to f, you should do

@printf(f, "%0.8f" df[irow,icol])

to write directly to the stream f.

Similarly, don’t do write(f, string(df[irow,icol])), do

print(f, df[irow,icol])

In general, use print for writing text, and write for raw binary representations.

Also, don’t benchmark in global scope. Put all performance-critical code into a function.

1 Like

Steve,
Thanks for the advice - I implemented the changes you recommended, unfortunately the impact is very low, in the range of a second.
I tried to remove the @printf(f, "%0.8f", ...) and the execution time went down to ~7 seconds! Which is still twice slower than the CSV… and not valid for my BCP job…

Finally,
I ended up changing the data type of the DataFrame to comply the database requirements.
I preprocess the Float64 columns this way:

df[:Test] = map( (x) -> string(round(x; digits=8)) , df[:Test])

This way the column contents are already truncated to fit the database DECIMAL(26, 8) data type we have in the database, so the BCP will not complain about the length anymore. The CSV.write() does not provide formatting flexibility, so adapting the input is, at least for now I think, the way to go.

Please shoot if you have a better way to do this!

Thanks

Good question. I haven’t tested in months but now Printf is a lot faster than before! There must have been huge improvements recently.

Here’s my test result:

julia> x = rand(100_000);

julia> foo(x) = open("/tmp/test.csv", "w") do f 
           for v in x
               @printf(f, "%16.8f\n",v)
           end
       end
foo (generic function with 1 method)

julia> goo(x) = open("/tmp/test2.csv", "w") do f 
           fspec = FormatExpr("{:16.8f}")
           for v in x
               printfmtln(f, fspec, v)
           end
       end
goo (generic function with 1 method)

julia> @btime foo($x)
  36.167 ms (11 allocations: 688 bytes)

julia> @btime goo($x)
  52.566 ms (500032 allocations: 9.92 MiB)
1 Like