Parse string representation of a vector into floats

ufechner7 · September 4, 2024, 9:43am

How can I parse the following string:

"Float32[0.0, 8.455541, 16.61883, 24.443338, 31.901924, 38.97408, 45.642822, 46.02448, 47.268967, 47.10146, 47.10146]"

as vector of Float32 values?

fredrikekre · September 4, 2024, 9:49am

julia> str = "Float32[0.0, 8.455541]";

julia> m = match(r"\[(.*)\]", str)
RegexMatch("[0.0, 8.455541]", 1="0.0, 8.455541")

julia> strs = split(m[1], ','; keepempty=false)
2-element Vector{SubString{String}}:
 "0.0"
 " 8.455541"

julia> parse.(Float32, strs)
2-element Vector{Float32}:
 0.0
 8.455541

(Why do you have a string like that in the first place?)

screw_dog · September 4, 2024, 10:09am

Just to add: the highly not recommended way to do this is to use Meta.parse and eval, which allows you to treat a string as Julia code. But this is almost always not the best way to do things.

ufechner7 · September 4, 2024, 11:28am

I am importing a CSV file which itself was exported from an .arrow file.

ufechner7 · September 4, 2024, 12:14pm

This mostly works, but sometimes it does not. Example:

"Float32[0.0, 3.5440563f-28, -8.439492f-25, 7.1634285f-22, -4.307094f-19, 1.672701f-16, 1.2102875f-14, 5.515009f-14, 8.999296f-14, 2.8241575, -2.8241575]"

This results in:

ERROR: LoadError: ArgumentError: cannot parse " 3.5440563f-28" as Float32

What would be a good way to avoid this error?

stevengj · September 4, 2024, 12:51pm

str = replace(str, 'f'=>'e'). You could also use GitHub - JuliaData/Parsers.jl: fast parsing machinery for basic types in Julia, which should support this format directly.

See `parse` can't parse our own Float32 output format · Issue #5690 · JuliaLang/julia · GitHub

StefanKarpinski · September 4, 2024, 12:51pm

Why is Float32[ in the CSV file? Wouldn’t it just be the actual float values as strings?

ufechner7 · September 4, 2024, 1:41pm

Because I exported it from a StructArray using CSV.jl

github.com

ufechner7/KiteUtils.jl/blob/93e5e5e20c5db1b9ba9e0e40c8d4426d7a26af95/src/KiteUtils.jl#L587C1-L593C4


      
          function export_log(flight_log; path="")
              if path == ""
                  path = DATA_PATH[1]
              end
              filename = joinpath(path, flight_log.name) * ".csv"
              CSV.write(filename, flight_log.syslog)
          end

Correction:
syslog is a struct array and some of the elements of the struct are Vectors.
CSV.write is writing them in this way.

ufechner7 · September 4, 2024, 1:55pm

For future reference:

using Parsers

function parse_vector(str)
    m = match(r"\[(.*)\]", str)
    strs = split(m[1], ','; keepempty=false)
    Parsers.parse.(Float32, strs)
end

stevengj · September 4, 2024, 5:00pm

If you are controlling the format, then I would use a different format from CSV for this kind of data. Something like JSON or JLD/HDF5 seems more suited.

ufechner7 · September 4, 2024, 5:04pm

Well, by default I use the Apache arrow format. But it causes some issues when the format changes (I am adding or removing columns). In that case, CSV is a more generic choice that still works after I added columns. So if I have an old log file I export it as CSV and import it into the new format and then save it in the new format as .arrrow file.

I needed this function for this purpose.

And I am very happy to have a 2.5 dimensional log file format where columns can contain vectors. Easy to process and filter and export to Python, for example.

StefanKarpinski · September 4, 2024, 5:45pm

Well, except for this problem. I think that @stevengj’s point is that in this case you’re saving more complex nested data structures, which makes CSV awkward.

If you’re going to have CSVs where some columns are arrays, but not get any more complex than that, it might be better to custom print those in a format that’s simpler to parse instead of relying on Julia’s value printing, e.g. represent a vector of values as a string containing delimiter-separated values.

If the structure will be arbitrarily nested and complex, then you’re going to keep hitting these issues and will definitely want a standard format like JSON or HDF5, both of which can be easily read in other languages, but it sounds like you’re not in that case.

ufechner7 · September 4, 2024, 6:04pm

Well, I used Google protocol buffer messages for our log files in the past to be fully n-dimensional, but in the end, if you want to plot or analyze the data you need 2-dimensional data with a fixed sampling rate.

My system is represented as particle system with a small number of particles, like 11, and I have columns for the X, Y and Z position vectors of these particles, and one column for the quaternium that represents the orientation of the kite.

No need for deeply nested structures.

And I like Julias way to print vectors: Easily readable and clear. You can round-trip it because the type is included. OK, the “f” instead of “e” for the exponent in scientific notation is somewhat confusing.

StefanKarpinski · September 4, 2024, 8:05pm

You could also explode that into multiple scalar columns.

ufechner7 · September 4, 2024, 8:41pm

Very bad idea. Well, for the orientation that would be an option, but for the position vectors not, because the number of elements varies. And I do not want to have a different number of columns depending on the simulation that I run. Finally, CSV is not the main format I use, the main format is Apache arrow, and in arrow format you can have vectors in a column in a very well defined way. And in the CSV format I want to have the same columns as in the arrow files, otherwise automatic conversion is not possible.

rafael.guerra · September 4, 2024, 10:04pm

It is not so much a bad idea as a badly defined problem to begin with.

rafael.guerra · September 4, 2024, 10:32pm

Another option:

str = "Float32[0.0, 3.5440563f-28, -8.439492f-25, 7.1634285f-22, -4.307094f-19, 1.672701f-16, 1.2102875f-14, 5.515009f-14, 8.999296f-14, 2.8241575, -2.8241575]"

ix1, ix2 = findfirst('[', str) + 1, findfirst(']', str) - 1
parse.(Float32, split(replace(str[ix1:ix2], 'f'=>'e'), ','))

liuyxpp · September 5, 2024, 1:04am

Assume your maximum number of elements of your position vectors is 3. Then no matter the actual number of elements of your position vectors in a specific simulation, I would save them as if they are 3D with other dim value being dummy values (usually 0?). It may increase the size of your CSV file for 1D or 2D simulation data, but at least they are consistent across all possible dimensions.

Topic		Replies	Views
How to parse vector/array of string? New to Julia question	3	4568	January 15, 2017
Converting SubString Array to Array of Floats without loops New to Julia	9	6064	July 13, 2018
Quickest method to convert Vector of Strings to 3 Vectors of Floats General Usage	2	1295	November 27, 2019
Convert Any["0.30", "0.34"] to [0.30, 0.34] New to Julia question	7	462	August 30, 2021
Parse vector from string General Usage strings , sparse	24	6495	March 27, 2023

Parse string representation of a vector into floats

Related topics