Convert a percentage string (e.g. "99.8%") to Float64 when extracting data from CSV

Speaking as a biologist that, up to and including my PhD dissertation, used excel exclusively, I’m offended by this statement. Also, it’s 100% accurate :joy:

9 Likes

Unfortunately some of us have to exchange data with folks for which CSV is the lowest common denominator. I work in an engineering research environment where each researcher is a bit independent but we have to work together and exchange data. All data is sent around and stored as either CSV or Excel format because anyone can manage to ingest it into whatever script/code/program they are using. I could start building up my own duplicate store of data in a better format but I’d be the only one using it and have to convert anything I created back to CSV to pass it on to other folks.

It comes with pros and cons. The independence afforded to my colleagues and myself lets me use Julia when I want instead of being locked into Matlab/C/Fortran but causes data interchange issues.

2 Likes

The president of the Brazilian central bank runs his economic models in Excel spreadsheets, and from those defines the interest rates and the future of our ecconomy. I am not sure if the spreadsheets or the economic models are the worst part.

2 Likes

If R, or Julia, gained widespread acceptance for everyday business work, would productivity improve substantially? The way excel seems to have increased output per worker throughout the 1990s? There’s a paper there.

CSV is actually the least problematic of the “popular” file format for data exchange of different systems I usually come across. At least it is better than xlsx, Access, fixed width ASCII and (the unfortunately increasingly popular) XML…

This is tricky to measure. Suppose I manage to crank out answers at twice the speed, but 50% them are wrong (this is only learned months later, and the only way to tell is if someone replicates my work). Did my productivity improve?

1 Like

Do businesses have internal estimates of errors? If I invoice the wrong amount due to a bad excel workbook, are there repercussions? Perhaps firm-wide smoothing makes these negligible.

In principle yes, it is called “Operational Risk” and is also required from regulatory side. IT risk is a part of it, but also other aspects, e.g. legal risk, reputation risk, etc.
There is some regulatory pressure to move away from end user computing (Excel, Access, etc.) to “real” IT applications, at least for the business / regulatory critical parts. But this is a lengthy and non-trivial process.

1 Like

I don’t know what CSV or Excel have to do in my request. I was referring to the base function parse.

As it understand scientific format (e.g. parse(Float64,"1.2E-02") works), I don’t see why it shouldn’t understand parse(Float64,"1.2%").

I understand that the first format is more common in the community of current Julia users, but I would not look down all that uses percentages…

my guess is because that it’s in some IEEE standard.

1 Like

This made me wonder…do users in countries that use commas in place of decimal points have to write custom functions to parse strings to numbers?

# Comma separator
julia> parse(Float64, "99,2")
ERROR: ArgumentError: cannot parse "99,2" as Float64

# Point separator
julia> parse(Float64, "99.2")
99.2

If I’m not mistaken, certain countries use an apostrophe as a thousand separator. When you say I don’t see why it shouldn’t understand parse(Float64,"1.2%") , I think the issue is a bit more complicated than it may seem at first glance (as is almost always the case). Indeed, a quick Google search reveals that there are countries where the % is actually put before the number.

There might actually be enough variety here to write a package dedicated entirely to parsing different numbers represented as strings…a good name would be tough…CulturallySensitiveParser.jl is a bit of a mouthful :laughing:

What you are discussing is an other issue, the localisation:

https://unix.stackexchange.com/a/87763/77329

Because that’s not a valid floating point number for Julia. Eg try 1.2% in the REPL.

Generally, parse(T, ...) should parse types T and reject everything else. The clean solution is defining a method of tryparse for a custom type, as suggested above by @Per.

CSV.jl allows other decimal separators when importing a file

CSV.File(file; delim=';', decimal=',')

https://juliadata.github.io/CSV.jl/stable/#Custom-Decimal-Separator-1

I’m in France and this is a constant problem. Excel not only displays commas instead of points, but also stores them in the file. Every programming language I’ve ecountered sticks with points even if we use commas in every day life.
(also function names are translated in Excel, help)

2 Likes

This does not imply that you have to use CSV though; most contemporary programming languages and data analysis tools can manage at least HDF5 and JSON just fine these days.

That said, CSV can be a reasonable choice if its limitations are understood and some conventions are agreed upon. Problems arise because it is very easy to stretch the boundaries of CSV and/or creep outside various conventions without anyone noticing for a while, resulting in data corruption.

In the long run, a lot of teams end up paying a high price for using CSV (and unsanitized data in general). Most start out with CSV and only switch after some eye-opening event, eg finding that the last 5 months of results are invalid because of some silly quirk and require a lot of person-hours to fix.

I’d love to do that but I’m the low man on the totem pole and I won’t convince all the 30 years experience folks to change trajectory. None of us are programmers, we are experimentalists who have come to using code as a fact of dealing with scientific data in this day and age. To those folks late in their careers learning anything about modern programming and formats would just be a waste when they could be doing “real work” (according to them). I lean towards a view of “improving the process now will pay dividends for a long time” but I don’t get to make the rules. So as a result of being the low man on the totem pole if I send an HDF5/JSON file to someone and offer to teach them how/why to use it they just say “no, get me the data I need in a CSV now so I can keep moving” otherwise I’m wasting the time of a more valuable employee. In a few decades I’ll get to tell the young kids how I want data handed to me.

1 Like

:wink: