How to handle weakrefstrings when importing a CSV file

JasonHall · June 1, 2020, 2:41pm

I am using Julia 1.4 and CSV 0.62. I have a CSV file that has a column with some JSON in it (e.g. {"type":"Point","coordinates":[-3.73610686,58.3917902]})`. When I look at the data type I am told it is of type String. If I pull out one row’s value like this:

s = mydf[:geom][1]

I get an escaped string of type String that I can do normal string orperations on, like parse it into JSON using JSON.parse(). However, if I try to do any operation on the data in the dataframe like this (for instance):

mydf[:x] = JSON.parse(mydf[:geom])["coordinates"][1] #extract the first coordinate and store in a column named :x

then I get the following error:
ERROR: MethodError: no method matching parse(::WeakRefStrings.StringArray{String,1})

I get this same error if I try to replace the double quotes with anything. I also get the error if I manually strip out all the JSON parts of the string in Excel and leave only tuple. In other forums I see people saying to set the keyword argument weakrefstrings=false. However, that is no longer a valid keyword argument so I can’t do that.

Can anybody suggest how I can handle this? Can I coerce the data into being a proper string perhaps?

nilshg · June 1, 2020, 3:22pm

Shouldn’t you broadcast JSON.parse if you want to parse the whole column? I.e. JSON.parse.(mydf[:geom])?

JasonHall · June 1, 2020, 4:12pm

True, but unfortunately that doesn’t give me what I want. It would give a dictionary object in each row when I just want the first element of the coordinates. I might be able to do it as a two step process (it seems a shame I can’t do it in one operation), but while I understand how to broadcast when I have an operator, I don’t understand how to do it when I don’t:

mydf[:geomdict] = JSON.parse.(mydf[:geom]  # gives me a dictionary and does get around the weakref problem.
mydf[:x] = mydf[:geomdict]["geometry"][1]  # get the x coordinate - but how do I broadcast here?
mydf[:y] = mydf[:geomdict]["geometry"][2]  # get the y coordinate - but ditto,

In the above examples .= throws an error as column :x doesn’t exist. Where does the dot go in this case (line 2 or 3)?

I also notice that bizarrely JSON.parse gives me a dictionary as expected but as element 1 of an array??? So lines 2 and 3 above have to be something like (missing dot notwithstanding):

mydf[:x] = mydf[:geomdict][1]["geometry"][1]  # get the x coordinate - but how do I broadcast here?
mydf[:y] = mydf[:geomdict][1]["geometry"][2]  # get the y coordinate - but ditto,

nilshg · June 1, 2020, 4:29pm

This is all a bit harder without an MWE to actually try out, but essentially square bracket indexing is syntactic sugar for getindex, so you can always do something like:

julia> dict_arr = [Dict("a" => 1), Dict("a" => 2)]
2-element Array{Dict{String,Int64},1}:
 Dict("a" => 1)
 Dict("a" => 2)

julia> getindex.(dict_arr, "a") # broadcast getindex
2-element Array{Int64,1}:
 1
 2

JasonHall · June 1, 2020, 5:01pm

Re MWE: a very fair point. Here’s the first few rows (minus a lot of columns not needed for this discussion):

id	value	year	geometry
1	0.445591241121292	2015	{“type”:“Point”,“coordinates”:[-3.7361068677412166,58.39179006274112]}
2	0.441427767276764	2014	{“type”:“Point”,“coordinates”:[-3.7371190836803483,58.388378850435224]}
3	0.54601401090622	2015	{“type”:“Point”,“coordinates”:[-3.736369954703371,58.39074663309461]}
4	0.473496675491333	2016	{“type”:“Point”,“coordinates”:[-3.73638333200653,58.39090716073254]}
5	0.658025324344635	2017	{“type”:“Point”,“coordinates”:[-3.7366776326760576,58.39040774141455]}
6	0.508873164653778	2015	{“type”:“Point”,“coordinates”:[-3.736187131560178,58.390809060509355]}
7	0.403048127889633	2019	{“type”:“Point”,“coordinates”:[-3.7353265250568626,58.38924391603958]}
8	0.577482640743256	2019	{“type”:“Point”,“coordinates”:[-3.7391256791544,58.38983251737866]}
9	0.607733011245728	2015	{“type”:“Point”,“coordinates”:[-3.738871510394354,58.38889610615743]}
10	0.642166316509247	2014	{“type”:“Point”,“coordinates”:[-3.7384033047837417,58.38941336187963]}
11	0.591114699840546	2015	{“type”:“Point”,“coordinates”:[-3.738430059390062,58.389694285246]}
12	0.447087287902832	2016	{“type”:“Point”,“coordinates”:[-3.735915126395919,58.3906619101746]}

I need to get the coordinates into their own separate columns (one column for each member of the coordinate pairs).

nilshg · June 1, 2020, 5:11pm

Thanks for that, but I’m struggling to get this into a DataFrame format - you could try getting it into a copy-pasteable string representation in the way I’ve suggested in this SO answer

JasonHall · June 1, 2020, 5:20pm

Well, I’ve just learnt something else and here is the data:

“id,value,year,geometry\n1,0.445591241121292,2015,"{""type"":""Point"",""coordinates"":[-3.7361068677412166,58.39179006274112]}"\n2,0.441427767276764,2014,"{""type"":""Point"",""coordinates"":[-3.7371190836803483,58.388378850435224]}"\n3,0.54601401090622,2015,"{""type"":""Point"",""coordinates"":[-3.736369954703371,58.39074663309461]}"\n4,0.473496675491333,2016,"{""type"":""Point"",""coordinates"":[-3.73638333200653,58.39090716073254]}"\n5,0.658025324344635,2017,"{""type"":""Point"",""coordinates"":[-3.7366776326760576,58.39040774141455]}"\n6,0.508873164653778,2015,"{""type"":""Point"",""coordinates"":[-3.736187131560178,58.390809060509355]}"\n7,0.403048127889633,2019,"{""type"":""Point"",""coordinates"":[-3.7353265250568626,58.38924391603958]}"\n8,0.577482640743256,2019,"{""type"":""Point"",""coordinates"":[-3.7391256791544,58.38983251737866]}"\n9,0.607733011245728,2015,"{""type"":""Point"",""coordinates"":[-3.738871510394354,58.38889610615743]}"\n10,0.642166316509247,2014,"{""type"":""Point"",""coordinates"":[-3.7384033047837417,58.38941336187963]}"\n11,0.591114699840546,2015,"{""type"":""Point"",""coordinates"":[-3.738430059390062,58.389694285246]}"\n12,0.447087287902832,2016,"{""type"":""Point"",""coordinates"":[-3.735915126395919,58.3906619101746]}"\n”

Henrique_Becker · June 1, 2020, 6:01pm

If you do not use triple quotes blocks (i.e., ```text```) then your "" are replaced “”, and so will break the code of anyone trying to parse from a cut-and-paste.

nilshg · June 1, 2020, 6:07pm

Great thanks, what you want is:

julia> first.(getindex.(JSON.parse.(data.geometry), "coordinates"))
12-element Array{Float64,1}:
 -3.7361068677412166
 -3.7371190836803483
 -3.736369954703371
 -3.73638333200653
 -3.7366776326760576
 -3.736187131560178
 -3.7353265250568626
 -3.7391256791544
 -3.738871510394354
 -3.7384033047837417
 -3.738430059390062
 -3.735915126395919

julia> last.(getindex.(JSON.parse.(data.geometry), "coordinates"))
12-element Array{Float64,1}:
 58.39179006274112
 58.388378850435224
 58.39074663309461
 58.39090716073254
 58.39040774141455
 58.390809060509355
 58.38924391603958
 58.38983251737866
 58.38889610615743
 58.38941336187963
 58.389694285246
 58.3906619101746

While first and last work in this case, you could generalize this by replacing it with getindex.(getindex.(JSON.parse.(data.geometry), "coordinates"), n) where n is the index you need (i.e. 1 or 2 in your case).

JasonHall · June 1, 2020, 6:21pm

Awesome! Thank you so much. So I just needed to dot more! I had assumed that you only dot the first one because the JSON.parse in this case would be just on the “current” getindex. row. I was getting errors trying that which said the column lengths need to match. Your example explains why. The inner dot makes an array of all rows parsed as a dictionary. The ‘middle’ dot makes an array of the geometry index and the outer dot then broadcasts over all the elements of that array. That is very helpful to see this as a nested example.

haberdashPI · June 1, 2020, 6:29pm

The ‘middle’ dot makes an array of the geometry index and the outer dot then broadcasts over all the elements of that array.

Note that no intermediate arrays are created when you use multiple dots in a single broadcasting expression. That is, both of the two expressions below avoid creating any intermediate arrays.

first.(getindex.(JSON.parse.(data.geometry), "coordinates"))
broadcast(x -> first(getindex(JSON.parse(x), "coordinates")), data.geometry)

It’s a very handy feature of the “dot” syntax.

JasonHall · June 1, 2020, 9:11pm

Aah! OK that sounds much more efficient. Thanks again!

Interestingly, in all the tutorials on Julia I have done, this is the first time I have seen broadcast as a function name. The examples of broadcasting always use the dotted syntax and never have more than one dot. I find the form using the broadcast function more readable and intuitive just because there are so many dots with two different meanings in the first example - presumably that’s just a syntactic style thing and there is no performance hit of using one over the other?

tlienart · June 2, 2020, 6:42am

You can also use @. in front of an expression (or even in front of a line) and everything gets dotted e.g.:

x = range(0, 1, length=100)
y = @. exp(x * cos(x^2))  # equiv to exp.(x .* cos.(x.^2))

( ?@. for more information)

JasonHall · June 2, 2020, 8:16am

That’s brilliant! I came to this forum with a question about weakrefstrings but am walking away with so much more practical info.

Topic		Replies	Views
CSV.jl error - cannot convert an object of type WeakRefString General Usage	13	1794	October 13, 2018
Query.jl fails - Is WeakRefString string comparison broken? General Usage query	0	480	October 17, 2017
Nullables - why? and how? New to Julia	6	2439	December 19, 2017
How to change value of Nullable{WeakRefString{UInt8}} using CSV package? Data data	4	1323	September 26, 2017
CSV woes and SubString documentation New to Julia	9	1480	December 24, 2017

How to handle weakrefstrings when importing a CSV file

Related topics