Parse string representation of a vector into floats

How can I parse the following string:

"Float32[0.0, 8.455541, 16.61883, 24.443338, 31.901924, 38.97408, 45.642822, 46.02448, 47.268967, 47.10146, 47.10146]"

as vector of Float32 values?

julia> str = "Float32[0.0, 8.455541]";

julia> m = match(r"\[(.*)\]", str)
RegexMatch("[0.0, 8.455541]", 1="0.0, 8.455541")

julia> strs = split(m[1], ','; keepempty=false)
2-element Vector{SubString{String}}:
 "0.0"
 " 8.455541"

julia> parse.(Float32, strs)
2-element Vector{Float32}:
 0.0
 8.455541

(Why do you have a string like that in the first place?)

Just to add: the highly not recommended way to do this is to use Meta.parse and eval, which allows you to treat a string as Julia code. But this is almost always not the best way to do things.

I am importing a CSV file which itself was exported from an .arrow file.

This mostly works, but sometimes it does not. Example:

"Float32[0.0, 3.5440563f-28, -8.439492f-25, 7.1634285f-22, -4.307094f-19, 1.672701f-16, 1.2102875f-14, 5.515009f-14, 8.999296f-14, 2.8241575, -2.8241575]"

This results in:

ERROR: LoadError: ArgumentError: cannot parse " 3.5440563f-28" as Float32

What would be a good way to avoid this error?

str = replace(str, 'f'=>'e'). You could also use GitHub - JuliaData/Parsers.jl: fast parsing machinery for basic types in Julia, which should support this format directly.

See `parse` can't parse our own Float32 output format · Issue #5690 · JuliaLang/julia · GitHub

1 Like

Why is Float32[ in the CSV file? Wouldn’t it just be the actual float values as strings?

1 Like

Because I exported it from a StructArray using CSV.jl

Correction:
syslog is a struct array and some of the elements of the struct are Vectors.
CSV.write is writing them in this way.

For future reference:

using Parsers

function parse_vector(str)
    m = match(r"\[(.*)\]", str)
    strs = split(m[1], ','; keepempty=false)
    Parsers.parse.(Float32, strs)
end
3 Likes

If you are controlling the format, then I would use a different format from CSV for this kind of data. Something like JSON or JLD/HDF5 seems more suited.

Well, by default I use the Apache arrow format. But it causes some issues when the format changes (I am adding or removing columns). In that case, CSV is a more generic choice that still works after I added columns. So if I have an old log file I export it as CSV and import it into the new format and then save it in the new format as .arrrow file.

I needed this function for this purpose.

And I am very happy to have a 2.5 dimensional log file format where columns can contain vectors. Easy to process and filter and export to Python, for example.

Well, except for this problem. I think that @stevengj’s point is that in this case you’re saving more complex nested data structures, which makes CSV awkward.

If you’re going to have CSVs where some columns are arrays, but not get any more complex than that, it might be better to custom print those in a format that’s simpler to parse instead of relying on Julia’s value printing, e.g. represent a vector of values as a string containing delimiter-separated values.

If the structure will be arbitrarily nested and complex, then you’re going to keep hitting these issues and will definitely want a standard format like JSON or HDF5, both of which can be easily read in other languages, but it sounds like you’re not in that case.

3 Likes

Well, I used Google protocol buffer messages for our log files in the past to be fully n-dimensional, but in the end, if you want to plot or analyze the data you need 2-dimensional data with a fixed sampling rate.

My system is represented as particle system with a small number of particles, like 11, and I have columns for the X, Y and Z position vectors of these particles, and one column for the quaternium that represents the orientation of the kite.

No need for deeply nested structures.

And I like Julias way to print vectors: Easily readable and clear. You can round-trip it because the type is included. OK, the “f” instead of “e” for the exponent in scientific notation is somewhat confusing.

You could also explode that into multiple scalar columns.

Very bad idea. Well, for the orientation that would be an option, but for the position vectors not, because the number of elements varies. And I do not want to have a different number of columns depending on the simulation that I run. Finally, CSV is not the main format I use, the main format is Apache arrow, and in arrow format you can have vectors in a column in a very well defined way. And in the CSV format I want to have the same columns as in the arrow files, otherwise automatic conversion is not possible.

It is not so much a bad idea as a badly defined problem to begin with.

1 Like

Another option:

str = "Float32[0.0, 3.5440563f-28, -8.439492f-25, 7.1634285f-22, -4.307094f-19, 1.672701f-16, 1.2102875f-14, 5.515009f-14, 8.999296f-14, 2.8241575, -2.8241575]"

ix1, ix2 = findfirst('[', str) + 1, findfirst(']', str) - 1
parse.(Float32, split(replace(str[ix1:ix2], 'f'=>'e'), ','))

Assume your maximum number of elements of your position vectors is 3. Then no matter the actual number of elements of your position vectors in a specific simulation, I would save them as if they are 3D with other dim value being dummy values (usually 0?). It may increase the size of your CSV file for 1D or 2D simulation data, but at least they are consistent across all possible dimensions.