Convert a percentage string (e.g. "99.8%") to Float64 when extracting data from CSV

alexlowth · September 8, 2020, 4:50pm

Hi,

Does anyone know if there is a possible way to convert a percentage from a string to a Float64 when extracting data from a CSV. Or is the only way to convert the data to decimal format in excel before pulling through to Julia?

I have tried using the parse function, but that doesn’t work.

As a simplification, I have replaced the CSV extract with the raw string to help visualise:

parse(Float64, “99.8%”) which produces the error ArgumentError: cannot parse “99.8%” as Float64*

Thanks,
Alex

anon37204545 · September 8, 2020, 4:55pm

99.8% is not an acceptable float. What you want is to to take str[1:end-1] (all characters but the last one) and then multiply it by 0.01.

julia> parse(Float64, "99.8%"[1:end-1]) * 0.01
0.998

lmiq · September 8, 2020, 4:55pm

You have to get rid of the “%”:

julia> parse(Float64,replace("99.8%","%"=>""))
99.8

mthelm85 · September 8, 2020, 7:05pm

No Julia thread would be complete without discussing performance. I tried another version of this using RegEx and it’s slower than @lmiq’s version :

using BenchmarkTools

values = ["$(rand(0.0:0.01:100.0))%" for _ in 1:10_000]

julia> @btime [parse(Float64, match(r"[-+]?[0-9]*\.?[0-9]+", val).match) for val in $values]
  6.252 ms (30002 allocations: 2.21 MiB)
10000-element Array{Float64,1}:
 45.26
 30.06
 60.07
 12.7
 59.45
 29.99
 80.86
 83.66
 77.91
  ⋮

julia> @btime [parse(Float64, replace(val,"%" => "")) for val in $values]
  5.855 ms (40002 allocations: 2.21 MiB)
10000-element Array{Float64,1}:
 45.26
 30.06
 60.07
 12.7
 59.45
 29.99
 80.86
 83.66
 77.91
  ⋮

kevbonham · September 8, 2020, 8:01pm

anon37204545:

99.8% is not an acceptable float. What you want is to to take str[1:end-1] (all characters but the last one) and then multiply it by 0.01.
julia> parse(Float64, "99.8%"[1:end-1]) * 0.01
0.998

Note that this does something different than the other two answers, rendering the percent as a fraction (eg 95% becomes 0.95). Also, in general, I would get out of the habit of indexing into strings where possible. It’s convenient, but can cause unexpected results of you have characters that use multiple code points. In this case, it’s probably fine.

sylvaticus · September 8, 2020, 8:24pm

Shouldn’t percentages directly be “understood” by the parse function?

It has not been implemented because lack of time/interest, or some clever reasons?

pdeffebach · September 8, 2020, 8:28pm

I checked R and Stata and neither have those implemented, so it’s not the norm.

I think there is room for a parse function that gets rid of all non-digits and non . or ,

StefanKarpinski · September 8, 2020, 8:42pm

Trying to reimplement Excel in the CSV reader seems like a bad road to go down. Should CSV.jl also start interpreting strings like SEPT1 as dates? If we lean in hard maybe we can get some genes renamed.

StefanKarpinski · September 8, 2020, 8:47pm

As a more general comment, I really don’t understand trying to cram all this intelligence and performance into CSV readers. CSV is an awful format that requires parsing and guessing. If you get data in CSV format, you should parse it exactly once and then save it in a sane format that can just be loaded without all that nonsense. If there are columns that use 99.8% as a number format, load them as strings, parse those strings and convert to numbers and then save the result in a reasonable format that doesn’t have all these problems. If you are repeatedly using a CSV parser on the same data, you have only yourself to blame if that takes longer than you would like it to.

jling · September 8, 2020, 11:11pm

insane, does it mean:

people who deal with genes don’t know how to use Excel (to set columns to “text”)
people USE EXCEL TO do genetic stuff?

xiaodai · September 9, 2020, 12:44am

For me it’s more valuable to have a separate util outside of CSV that can DETECT these issues and suggest improvements. I mean xx.x% are easy patterns to support. Of course, these utils can work with CSV but making it easy to transform the column at read time, but I’d say keep CSV.jl lean .

Per · September 9, 2020, 6:37am

Maybe one could write a small package to implement a Percent type, and make it possible to do

parse(Percent{Float64}, "99.5%")

Parsing CSV would then work as long as the user specifies which columns should be parsed as this type. A missing “%” at the end of the string would then be a parse error.

(The division by 100 should not happen until converting to another numeric type, since 1//100 is not exactly representable in floating-point.)

anon37204545 · September 9, 2020, 7:23am

I tried this once, but I got stuck at promotion and conversion. How should such type act in respect to other Numbers?

alexlowth · September 9, 2020, 8:28am

You say to parse a string that contains “99.8%” but the point of the post was that this exact thing can’t be parsed? Sorry if misunderstood

Tamas_Papp · September 9, 2020, 8:37am

That’s just the tip of the iceberg. The more you learn about what people actually use Excel for, the more you become convinced that the current technological human civilization and specifically scientific research as such is inevitably doomed.

Per · September 9, 2020, 8:58am

Promotion is a bit tricky when introducing new Number types. I would start with something simple like

promote_rule(::Type{Percent{S}}, ::Type{T}) where {S,T <: Number} = promote_type(S,T)

but there would likely be problems with ambiguities that would need to be ironed out.

It would also be nice to have

promote_rule(::Type{Percent{S}}, ::Type{Rational{T}}) where {S <: Integer, T <: Integer} = Rational{promote_type(S,T)}

so that 1% + 1//2 resulted in exactly 51//100 instead of approximately 0.51000000000000001, but this is probably being overly pedantic.

alexlowth · September 9, 2020, 9:09am

Hi Zlatan, thanks for the solution. I guess the next question would be do you know how to do this on a vector of percentage strings? My understanding is that this would work for a single string

E.g: a1_rfr_disc
Vector{Union{Missing, String}} with 10 elements
“99.8%”
“98.8%”
“97.1%”
“94.6%”
“91.7%”
“88.5%”
“85.3%”
“82.1%”
“78.9%”
“75.9%”

anon37204545 · September 9, 2020, 9:11am

A clear solution would be to make a method which does what I described:

julia> percent(str) = parse(Float64, str[1:end-1]) * 0.01
percent (generic function with 1 method)

and then broadcast it (using .) to a vector:

julia> percent.(["99.8%", "99.7%"])
2-element Array{Float64,1}:
 0.998
 0.997

alexlowth · September 9, 2020, 9:13am

Ah yes, I didn’t think of that but makes sense. Thanks

bernhard · September 9, 2020, 10:37am

Actually it mainly means that Excel does some very strange stuff to data, which can be difficult to prevent.
Eg when opening a CSV file Excel will automatically format the data and there is nothing you can do against it (iirc).