Converting CSV string values to floats (Python to Julia)

George_Korpas · January 31, 2021, 3:27pm

I am very new to Julia and I am struggling with the following. I have a python hdf5 file in Python full of data and I convert it to a Python bumpy DataFrame which I then save to a csv file. Then I want to use this file in Julia but it looks much different than in Python:

Screen Shot 2021-01-31 at 11.25.54 PM|497x499

Python’s “0” index is converted to a header when I read the csv file in Julia
In python I have a dataframe 100x1 but in Julia this gets converted to 99x2.

See attached images.

Furthermore, in Julia the data is converted into strings which means I cannot use them really. Can somebody help to figure this out?

I load the file in Julia using:

data = CSV.read("data.csv", DataFrame);

I also attach the csv here: https://drive.google.com/file/d/10KbvkyYv1Bmg4Kez2oRyuXiCgunDDUtH/view?usp=sharing

stillyslalom · January 31, 2021, 4:08pm

You should be able to load the HDF5 file directly using HDF5.jl. If that’s not an option, it’d help if you could provide the raw CSV file (or at least the first few rows) and whatever commands you’re currently using to load it. See this post for pointers on how to make your question easier to answer:

George_Korpas · January 31, 2021, 4:15pm

I just did as you said. I use the standard data = CSV.read("data.csv", DataFrame); command and I linked the csv file.

stillyslalom · January 31, 2021, 4:38pm

Your CSV file isn’t publicly visible with your current sharing settings. If you already have a GitHub account, the easiest thing is to upload it to gist.github.com.

George_Korpas · January 31, 2021, 5:11pm

Check again please.It is now.

pdeffebach · January 31, 2021, 5:23pm

Use the keyword argument header = false in CSV.read
Figure out how to use parse the column to an object of type Complex{Float64}, i.e.

julia> y = "65.4 + 98.2im"
"65.4 + 98.2im"

julia> parse(Complex{Float64}, y)
65.4 + 98.2im

What you need to figure out how to do is how to “clean” the values of your data frame so that they work with parse.

This involves

strip to remove extra white space
using replace to remove the ( and ).

Here is a full example

julia> using CSV, Chain, DataFrames

julia> df = CSV.read("data.csv", DataFrame; delim = ",", header = false);

julia> function clean_parse_complex(x)
           c = @chain x begin
               strip() 
               replace("(" => "")
               replace("(" => "")
               parse(Complex{Float64}, _)
           end
       end
clean_parse_complex (generic function with 1 method)

julia> df.c = clean_parse_complex.(df.Column1);

One thing to note, though, is it looks like all your values are real! They all have 0 imaginary component. Maybe you are encoding things as complex in python when that isn’t necessary?

George_Korpas · January 31, 2021, 5:25pm

This is part of my data. I have more with complex values. I will try and see if it works. Not sure how I could figure all this out without assistance here.

pdeffebach · January 31, 2021, 5:27pm

The process of “figuring it out” should be the same as in any language. If you want to parse a string, you have to clean it up a bit first.

From there it’s just a matter of using the ? tool in the command line. ? parse, ? replace in order to nail the syntax

George_Korpas · February 2, 2021, 6:57am

Thanks a lot. I am trying to use the function that you defined above for a second dataset which actually involves imaginary parts. Nevertheless I get the error:
ArgumentError: expected trailing "im", found only "m"
Stacktrace: [1] tryparse_internal(::Type{Complex{Float64}}, ::String, ::Int64, ::Int64, ::Bool) at ./parse.jl:316 [2] parse(::Type{Complex{Float64}}, ::String) at ./parse.jl:378 [3] clean_parse_complex(::String) at ./In[46]:11 [4] _broadcast_getindex_evalf at ./broadcast.jl:648 [inlined] [5] _broadcast_getindex at ./broadcast.jl:621 [inlined] [6] getindex at ./broadcast.jl:575 [inlined] [7] macro expansion at ./broadcast.jl:932 [inlined] [8] macro expansion at ./simdloop.jl:77 [inlined] [9] copyto! at ./broadcast.jl:931 [inlined] [10] copyto! at ./broadcast.jl:886 [inlined] [11] copy at ./broadcast.jl:862 [inlined] [12] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(clean_parse_complex),Tuple{Array{String,1}}}) at ./broadcast.jl:837 [13] top-level scope at In[46]:16 [14] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091

I cannot understand why it works for the first dataset but not the second one. I attach here the second dataset in case you can provide some assistance.

stillyslalom · February 2, 2021, 7:15am

For one thing, it has a header - you’ll need to change the header keyword argument for CSV.read from false to true.

julia> df = CSV.read("density_1.csv", DataFrame; delim=',');

julia> function clean_parse_complex(x)
           c = @chain x begin
               strip()
               replace("(" => "")
               replace(")" => "")
               replace("j" => "im")
               parse(Complex{Float64}, _)
           end
       end

julia> clean_parse_complex.(df."Element 01 of state |0>")
100-element Vector{ComplexF64}:
   -0.02587890624999966 - 0.013916015624999997im
  -0.005371093750000201 - 0.00024414062499999087im
   -0.02124023437499952 + 0.021728515624999976im
 -0.0031738281249997355 + 0.013427734375000016im
   -0.02172851562499974 + 0.001708984374999981im
...

George_Korpas · February 2, 2021, 7:32am

Thanks. So, I figured it out my self. Initially I was saving my data as csv using the following:

np.savetxt("density_3.csv", df, delimiter=",", fmt='%s')

in the example I initiated this topic with. Then I realized that the rest of my dataset was saved as csv files using:

df.to_csv('ata.csv', index=False)

As mentioned above indeed, the latter way includes a header that I did not know how to get rid off initially. But it all works now. Quite frustrating for a beginner.

carstenbauer · February 2, 2021, 9:25am

Beginning to learn a programming language by parsing strings doesn’t sound like a great idea to me. (And if you do, I don’t see how Julia is any more difficult in that regard than other languages.)

As has already been pointed out above, it is probably a mistake to convert your data to from HDF to CSV and then parse it in Julia. Why not use HDF5.jl like you’re using h5py in python directly? This will preserve all the proper data types and string parsing won’t be necessary at all.

Alternatively, you could try to convert a python data frame (pandas at least) directly to a Julia DataFrame via PyCall.jl and/or Pandas.jl. This would again avoid any csv business.

George_Korpas · February 2, 2021, 9:50am

Thanks. I 've been using symbolic tools for years and programming for such also for years but now only I have to deal with proper data. So, sometimes, I am unaware even of the proper terminology, e.g. parsing.

lungben · February 2, 2021, 11:23am

You can use

to directly import Pandas-style HDF files in Julia - for HDF, it calls Pandas under the hood (with Pandas.jl), therefore it should support all Python data types (including e.g. pickled strings).

Topic		Replies	Views
Foolproof method for converting to Float64 New to Julia dataframes , convert	10	2007	May 28, 2021
Tidying up a csv file (follow-up to question 53261/4) New to Julia dates , csv	9	1230	January 14, 2021
String7 type with read CSV? New to Julia	8	361	June 23, 2023
Load and reformatting CSV file New to Julia strings , dataframes , csv , parsing , io	6	672	October 10, 2021
CSV, DataFrame read data file with string and Float64 columns New to Julia dataframes	3	76	September 3, 2024

Converting CSV string values to floats (Python to Julia)

Related topics