Foolproof method for converting to Float64

I am reading some data from a CSV file, which is intended to be Float64. But it does not always parses as Float64- some times its type is Any, some times String, some times integer, etc.

To convert it to Float64, I use parse.(Float64,DataFrame.colname) but it does not always work. The data may have a currency symbol or a comma separator.

Imagine that the data in that particular column could be one of the format given below.

Is there a way of converting it to Float64 which always work? I am looking for a simple function which converts it in to Float64.

A1 = ["£131,222.23","£223,422"]   # I can't use $ symbol here as it throws an error.
A2 = ["131,222.23","223,422"]
A3 = ["£131222.23","£223422"]
A4 = ["131222.23","223422"]
A5 = [131222.23,223422]

For the strings, remove the £ and comma characters before parsing. For the values that are already floats, don’t do anything. If you’re not sure which type you have, either redesign your system so that you always have the same type by telling CSV to read everything as strings, or dispatch on the type.

The CSV file has several different types of fields. I am not sure if one can specify a particular type for one specified column. I gave an example of only a part of the data

You can

https://csv.juliadata.org/stable/#Providing-Types

The problem here is an expectation that data that isn’t quite numerical (contains units, commas, etc.) should still be treated numerically. One could argue that certain patterns like these should be caught by the CSV reader; however, doing that pattern matching is expensive to do by default, and might not actually be intended (some users might want the currency symbol, or the value isn’t actually a number, it just looks like one). In such a case, it’s best to write your own conversion routine that strips out such characters (with replace or something similar) and then converts the resulting strings to Float64.

2 Likes

Small aside, it seems like you’re parsing currency into floating point - note that this is usually not a good idea, since floating point is not capable of accurately representing every decimal point value you may get and may introduce roundoff errors (which is the last thing you want when dealing with currency). I’d recommend modeling your currency explicitly (e.g. with a struct containing two fields, one the leading part and the other being the 2 digit fraction, with appropriate definition for +,-,*,/ etc. - Another option would be to save whole cents instead, removing the need for fractional parts in the first place) and instructing CSV.jl to parse the relevant columns into your format instead.

In either case, you still have to canoicalize the to-be-parsed element.

1 Like

Unless OP is working in high performance finance where these round-off errors matter, this advice is likely overkill.

Uhh… I hope my banks frontend doesn’t aggregate my bank statements using floating point, even though that’s not high performance finance :grimacing: It’s good practice, since it prevents hard to track bugs with the order things are summed in.

1 Like

The data is originally contained in an excel file, so it is in one of the formats that excel regards as numerical. This data is then copied over to a CSV file as it is without any changes. I think Excel limits floats to 15-16 decimal places so accuracy is already reduced

There is also Tutorial · XLSX.jl if you want to skip the CSV step.

1 Like

You may want to read this part of the docs. Also (disclaimer: I haven’t tried it myself), this could be useful.