Converting CSV string values to floats (Python to Julia)

I am very new to Julia and I am struggling with the following. I have a python hdf5 file in Python full of data and I convert it to a Python bumpy DataFrame which I then save to a csv file. Then I want to use this file in Julia but it looks much different than in Python:

Screen Shot 2021-01-31 at 11.25.54 PM|497x499

  1. Python’s “0” index is converted to a header when I read the csv file in Julia
  2. In python I have a dataframe 100x1 but in Julia this gets converted to 99x2.

See attached images.

Furthermore, in Julia the data is converted into strings which means I cannot use them really. Can somebody help to figure this out?

I load the file in Julia using:

data = CSV.read("data.csv", DataFrame);

I also attach the csv here: https://drive.google.com/file/d/10KbvkyYv1Bmg4Kez2oRyuXiCgunDDUtH/view?usp=sharing

You should be able to load the HDF5 file directly using HDF5.jl. If that’s not an option, it’d help if you could provide the raw CSV file (or at least the first few rows) and whatever commands you’re currently using to load it. See this post for pointers on how to make your question easier to answer:

3 Likes

I just did as you said. I use the standard data = CSV.read("data.csv", DataFrame); command and I linked the csv file.

Your CSV file isn’t publicly visible with your current sharing settings. If you already have a GitHub account, the easiest thing is to upload it to gist.github.com.

Check again please.It is now.

  1. Use the keyword argument header = false in CSV.read
  2. Figure out how to use parse the column to an object of type Complex{Float64}, i.e.
julia> y = "65.4 + 98.2im"
"65.4 + 98.2im"

julia> parse(Complex{Float64}, y)
65.4 + 98.2im

What you need to figure out how to do is how to “clean” the values of your data frame so that they work with parse.

This involves

  1. strip to remove extra white space
  2. using replace to remove the ( and ).

Here is a full example

julia> using CSV, Chain, DataFrames

julia> df = CSV.read("data.csv", DataFrame; delim = ",", header = false);

julia> function clean_parse_complex(x)
           c = @chain x begin
               strip() 
               replace("(" => "")
               replace("(" => "")
               parse(Complex{Float64}, _)
           end
       end
clean_parse_complex (generic function with 1 method)

julia> df.c = clean_parse_complex.(df.Column1);

One thing to note, though, is it looks like all your values are real! They all have 0 imaginary component. Maybe you are encoding things as complex in python when that isn’t necessary?

3 Likes

This is part of my data. I have more with complex values. I will try and see if it works. Not sure how I could figure all this out without assistance here.

The process of “figuring it out” should be the same as in any language. If you want to parse a string, you have to clean it up a bit first.

From there it’s just a matter of using the ? tool in the command line. ? parse, ? replace in order to nail the syntax

6 Likes

Thanks a lot. I am trying to use the function that you defined above for a second dataset which actually involves imaginary parts. Nevertheless I get the error:
ArgumentError: expected trailing "im", found only "m"
Stacktrace: [1] tryparse_internal(::Type{Complex{Float64}}, ::String, ::Int64, ::Int64, ::Bool) at ./parse.jl:316 [2] parse(::Type{Complex{Float64}}, ::String) at ./parse.jl:378 [3] clean_parse_complex(::String) at ./In[46]:11 [4] _broadcast_getindex_evalf at ./broadcast.jl:648 [inlined] [5] _broadcast_getindex at ./broadcast.jl:621 [inlined] [6] getindex at ./broadcast.jl:575 [inlined] [7] macro expansion at ./broadcast.jl:932 [inlined] [8] macro expansion at ./simdloop.jl:77 [inlined] [9] copyto! at ./broadcast.jl:931 [inlined] [10] copyto! at ./broadcast.jl:886 [inlined] [11] copy at ./broadcast.jl:862 [inlined] [12] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(clean_parse_complex),Tuple{Array{String,1}}}) at ./broadcast.jl:837 [13] top-level scope at In[46]:16 [14] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091

I cannot understand why it works for the first dataset but not the second one. I attach here the second dataset in case you can provide some assistance.

For one thing, it has a header - you’ll need to change the header keyword argument for CSV.read from false to true.

julia> df = CSV.read("density_1.csv", DataFrame; delim=',');

julia> function clean_parse_complex(x)
           c = @chain x begin
               strip()
               replace("(" => "")
               replace(")" => "")
               replace("j" => "im")
               parse(Complex{Float64}, _)
           end
       end

julia> clean_parse_complex.(df."Element 01 of state |0>")
100-element Vector{ComplexF64}:
   -0.02587890624999966 - 0.013916015624999997im
  -0.005371093750000201 - 0.00024414062499999087im
   -0.02124023437499952 + 0.021728515624999976im
 -0.0031738281249997355 + 0.013427734375000016im
   -0.02172851562499974 + 0.001708984374999981im
...

Thanks. So, I figured it out my self. Initially I was saving my data as csv using the following:

np.savetxt("density_3.csv", df, delimiter=",", fmt='%s')

in the example I initiated this topic with. Then I realized that the rest of my dataset was saved as csv files using:

df.to_csv('ata.csv', index=False)

As mentioned above indeed, the latter way includes a header that I did not know how to get rid off initially. But it all works now. Quite frustrating for a beginner.

Beginning to learn a programming language by parsing strings doesn’t sound like a great idea to me. (And if you do, I don’t see how Julia is any more difficult in that regard than other languages.)

As has already been pointed out above, it is probably a mistake to convert your data to from HDF to CSV and then parse it in Julia. Why not use HDF5.jl like you’re using h5py in python directly? This will preserve all the proper data types and string parsing won’t be necessary at all.

Alternatively, you could try to convert a python data frame (pandas at least) directly to a Julia DataFrame via PyCall.jl and/or Pandas.jl. This would again avoid any csv business.

2 Likes

Thanks. I 've been using symbolic tools for years and programming for such also for years but now only I have to deal with proper data. So, sometimes, I am unaware even of the proper terminology, e.g. parsing.

You can use

to directly import Pandas-style HDF files in Julia - for HDF, it calls Pandas under the hood (with Pandas.jl), therefore it should support all Python data types (including e.g. pickled strings).