Does anyone know if there is a possible way to convert a percentage from a string to a Float64 when extracting data from a CSV. Or is the only way to convert the data to decimal format in excel before pulling through to Julia?
I have tried using the parse function, but that doesn’t work.
As a simplification, I have replaced the CSV extract with the raw string to help visualise:
parse(Float64, “99.8%”) which produces the error ArgumentError: cannot parse “99.8%” as Float64*
Note that this does something different than the other two answers, rendering the percent as a fraction (eg 95% becomes 0.95). Also, in general, I would get out of the habit of indexing into strings where possible. It’s convenient, but can cause unexpected results of you have characters that use multiple code points. In this case, it’s probably fine.
Trying to reimplement Excel in the CSV reader seems like a bad road to go down. Should CSV.jl also start interpreting strings like SEPT1 as dates? If we lean in hard maybe we can get some genes renamed.
As a more general comment, I really don’t understand trying to cram all this intelligence and performance into CSV readers. CSV is an awful format that requires parsing and guessing. If you get data in CSV format, you should parse it exactly once and then save it in a sane format that can just be loaded without all that nonsense. If there are columns that use 99.8% as a number format, load them as strings, parse those strings and convert to numbers and then save the result in a reasonable format that doesn’t have all these problems. If you are repeatedly using a CSV parser on the same data, you have only yourself to blame if that takes longer than you would like it to.
For me it’s more valuable to have a separate util outside of CSV that can DETECT these issues and suggest improvements. I mean xx.x% are easy patterns to support. Of course, these utils can work with CSV but making it easy to transform the column at read time, but I’d say keep CSV.jl lean .
Maybe one could write a small package to implement a Percent type, and make it possible to do
parse(Percent{Float64}, "99.5%")
Parsing CSV would then work as long as the user specifies which columns should be parsed as this type. A missing “%” at the end of the string would then be a parse error.
(The division by 100 should not happen until converting to another numeric type, since 1//100 is not exactly representable in floating-point.)
That’s just the tip of the iceberg. The more you learn about what people actually use Excel for, the more you become convinced that the current technological human civilization and specifically scientific research as such is inevitably doomed.
Hi Zlatan, thanks for the solution. I guess the next question would be do you know how to do this on a vector of percentage strings? My understanding is that this would work for a single string
E.g: a1_rfr_disc
Vector{Union{Missing, String}} with 10 elements
“99.8%”
“98.8%”
“97.1%”
“94.6%”
“91.7%”
“88.5%”
“85.3%”
“82.1%”
“78.9%”
“75.9%”
Actually it mainly means that Excel does some very strange stuff to data, which can be difficult to prevent.
Eg when opening a CSV file Excel will automatically format the data and there is nothing you can do against it (iirc).