Hello,
I am using the CSV package to export a DataFrame. One of the columns has DataType Float64, so it get exported using scientific notation format and the software that reads it (SQL Server bcp) cannot read it properly. I have tried to use the typemap options, and I don’t think it is the right way.
Ideally, I would like to limit the number of decimal to 8 and write to CSV the number like in the following example:
1234567.1234567890123456
to
1234567.12345678
instead of
1.2345671234567890123456e6
Could you give more information? The Microsoft SQL bcp documentation says that scientific notation is supported, but it looks like it might need a sign after e, i.e. 1.23e+6 and not 1.23e6. If that’s the only problem, it would be easy to add the + via post-processing.
@stevengj, you are indeed right, BCP does not have issues with the scientific notation (regardless of ± sign)
The issue I am facing is caused by the target field which is a DECIMAL(26,8) in the database. In the CSV file generated it is much longer as it comes from a Float64.
Something like this will fail:
Steve,
That is what we have in place right now and I am trying to replace it with something more performant.
The CSV package is ~y times faster than the rows/columns loop
Tom,
I edited the existing code to use the package you recommended.
Before it was using @sprintf (see the commented code).
fmt = "%.8f"
open(file_path1,"w") do f
for irow in 1:size(df)[1]
for icol in 1:size(df)[2]
if typeof(df[irow,icol]) == Float64
#write(f, @sprintf "%0.8f" df[irow,icol])
write(f, cfmt(fmt, df[irow,icol]))
else
write(f, string(df[irow,icol]))
end
write(f, ",")
end
end
The results are not what I expected as the execution time went from
17.255884 seconds
to
423.600634 seconds
So please let me know if I did something wrong in the above code.
ps With the CSV library it takes just ~2.5 seconds
Steve,
Thanks for the advice - I implemented the changes you recommended, unfortunately the impact is very low, in the range of a second.
I tried to remove the @printf(f, "%0.8f", ...) and the execution time went down to ~7 seconds! Which is still twice slower than the CSV… and not valid for my BCP job…
This way the column contents are already truncated to fit the database DECIMAL(26, 8) data type we have in the database, so the BCP will not complain about the length anymore. The CSV.write() does not provide formatting flexibility, so adapting the input is, at least for now I think, the way to go.
Good question. I haven’t tested in months but now Printf is a lot faster than before! There must have been huge improvements recently.
Here’s my test result:
julia> x = rand(100_000);
julia> foo(x) = open("/tmp/test.csv", "w") do f
for v in x
@printf(f, "%16.8f\n",v)
end
end
foo (generic function with 1 method)
julia> goo(x) = open("/tmp/test2.csv", "w") do f
fspec = FormatExpr("{:16.8f}")
for v in x
printfmtln(f, fspec, v)
end
end
goo (generic function with 1 method)
julia> @btime foo($x)
36.167 ms (11 allocations: 688 bytes)
julia> @btime goo($x)
52.566 ms (500032 allocations: 9.92 MiB)