How to handle weakrefstrings when importing a CSV file

I am using Julia 1.4 and CSV 0.62. I have a CSV file that has a column with some JSON in it (e.g. {"type":"Point","coordinates":[-3.73610686,58.3917902]})`. When I look at the data type I am told it is of type String. If I pull out one row’s value like this:

s = mydf[:geom][1]

I get an escaped string of type String that I can do normal string orperations on, like parse it into JSON using JSON.parse(). However, if I try to do any operation on the data in the dataframe like this (for instance):

mydf[:x] = JSON.parse(mydf[:geom])["coordinates"][1] #extract the first coordinate and store in a column named :x

then I get the following error:
ERROR: MethodError: no method matching parse(::WeakRefStrings.StringArray{String,1})

I get this same error if I try to replace the double quotes with anything. I also get the error if I manually strip out all the JSON parts of the string in Excel and leave only tuple. In other forums I see people saying to set the keyword argument weakrefstrings=false. However, that is no longer a valid keyword argument so I can’t do that.

Can anybody suggest how I can handle this? Can I coerce the data into being a proper string perhaps?

Shouldn’t you broadcast JSON.parse if you want to parse the whole column? I.e. JSON.parse.(mydf[:geom])?

True, but unfortunately that doesn’t give me what I want. It would give a dictionary object in each row when I just want the first element of the coordinates. I might be able to do it as a two step process (it seems a shame I can’t do it in one operation), but while I understand how to broadcast when I have an operator, I don’t understand how to do it when I don’t:

mydf[:geomdict] = JSON.parse.(mydf[:geom]  # gives me a dictionary and does get around the weakref problem.
mydf[:x] = mydf[:geomdict]["geometry"][1]  # get the x coordinate - but how do I broadcast here?
mydf[:y] = mydf[:geomdict]["geometry"][2]  # get the y coordinate - but ditto,

In the above examples .= throws an error as column :x doesn’t exist. Where does the dot go in this case (line 2 or 3)?

I also notice that bizarrely JSON.parse gives me a dictionary as expected but as element 1 of an array??? So lines 2 and 3 above have to be something like (missing dot notwithstanding):

mydf[:x] = mydf[:geomdict][1]["geometry"][1]  # get the x coordinate - but how do I broadcast here?
mydf[:y] = mydf[:geomdict][1]["geometry"][2]  # get the y coordinate - but ditto,

This is all a bit harder without an MWE to actually try out, but essentially square bracket indexing is syntactic sugar for getindex, so you can always do something like:

julia> dict_arr = [Dict("a" => 1), Dict("a" => 2)]
2-element Array{Dict{String,Int64},1}:
 Dict("a" => 1)
 Dict("a" => 2)

julia> getindex.(dict_arr, "a") # broadcast getindex
2-element Array{Int64,1}:
 1
 2

Re MWE: a very fair point. Here’s the first few rows (minus a lot of columns not needed for this discussion):

id value year geometry
1 0.445591241121292 2015 {“type”:“Point”,“coordinates”:[-3.7361068677412166,58.39179006274112]}
2 0.441427767276764 2014 {“type”:“Point”,“coordinates”:[-3.7371190836803483,58.388378850435224]}
3 0.54601401090622 2015 {“type”:“Point”,“coordinates”:[-3.736369954703371,58.39074663309461]}
4 0.473496675491333 2016 {“type”:“Point”,“coordinates”:[-3.73638333200653,58.39090716073254]}
5 0.658025324344635 2017 {“type”:“Point”,“coordinates”:[-3.7366776326760576,58.39040774141455]}
6 0.508873164653778 2015 {“type”:“Point”,“coordinates”:[-3.736187131560178,58.390809060509355]}
7 0.403048127889633 2019 {“type”:“Point”,“coordinates”:[-3.7353265250568626,58.38924391603958]}
8 0.577482640743256 2019 {“type”:“Point”,“coordinates”:[-3.7391256791544,58.38983251737866]}
9 0.607733011245728 2015 {“type”:“Point”,“coordinates”:[-3.738871510394354,58.38889610615743]}
10 0.642166316509247 2014 {“type”:“Point”,“coordinates”:[-3.7384033047837417,58.38941336187963]}
11 0.591114699840546 2015 {“type”:“Point”,“coordinates”:[-3.738430059390062,58.389694285246]}
12 0.447087287902832 2016 {“type”:“Point”,“coordinates”:[-3.735915126395919,58.3906619101746]}

I need to get the coordinates into their own separate columns (one column for each member of the coordinate pairs).

Thanks for that, but I’m struggling to get this into a DataFrame format - you could try getting it into a copy-pasteable string representation in the way I’ve suggested in this SO answer

Well, I’ve just learnt something else :smile: and here is the data:

“id,value,year,geometry\n1,0.445591241121292,2015,"{""type"":""Point"",""coordinates"":[-3.7361068677412166,58.39179006274112]}"\n2,0.441427767276764,2014,"{""type"":""Point"",""coordinates"":[-3.7371190836803483,58.388378850435224]}"\n3,0.54601401090622,2015,"{""type"":""Point"",""coordinates"":[-3.736369954703371,58.39074663309461]}"\n4,0.473496675491333,2016,"{""type"":""Point"",""coordinates"":[-3.73638333200653,58.39090716073254]}"\n5,0.658025324344635,2017,"{""type"":""Point"",""coordinates"":[-3.7366776326760576,58.39040774141455]}"\n6,0.508873164653778,2015,"{""type"":""Point"",""coordinates"":[-3.736187131560178,58.390809060509355]}"\n7,0.403048127889633,2019,"{""type"":""Point"",""coordinates"":[-3.7353265250568626,58.38924391603958]}"\n8,0.577482640743256,2019,"{""type"":""Point"",""coordinates"":[-3.7391256791544,58.38983251737866]}"\n9,0.607733011245728,2015,"{""type"":""Point"",""coordinates"":[-3.738871510394354,58.38889610615743]}"\n10,0.642166316509247,2014,"{""type"":""Point"",""coordinates"":[-3.7384033047837417,58.38941336187963]}"\n11,0.591114699840546,2015,"{""type"":""Point"",""coordinates"":[-3.738430059390062,58.389694285246]}"\n12,0.447087287902832,2016,"{""type"":""Point"",""coordinates"":[-3.735915126395919,58.3906619101746]}"\n”

1 Like

If you do not use triple quotes blocks (i.e., ```text```) then your "" are replaced “”, and so will break the code of anyone trying to parse from a cut-and-paste.

1 Like

Great thanks, what you want is:

julia> first.(getindex.(JSON.parse.(data.geometry), "coordinates"))
12-element Array{Float64,1}:
 -3.7361068677412166
 -3.7371190836803483
 -3.736369954703371
 -3.73638333200653
 -3.7366776326760576
 -3.736187131560178
 -3.7353265250568626
 -3.7391256791544
 -3.738871510394354
 -3.7384033047837417
 -3.738430059390062
 -3.735915126395919

julia> last.(getindex.(JSON.parse.(data.geometry), "coordinates"))
12-element Array{Float64,1}:
 58.39179006274112
 58.388378850435224
 58.39074663309461
 58.39090716073254
 58.39040774141455
 58.390809060509355
 58.38924391603958
 58.38983251737866
 58.38889610615743
 58.38941336187963
 58.389694285246
 58.3906619101746

While first and last work in this case, you could generalize this by replacing it with getindex.(getindex.(JSON.parse.(data.geometry), "coordinates"), n) where n is the index you need (i.e. 1 or 2 in your case).

1 Like

Awesome! Thank you so much. So I just needed to dot more! I had assumed that you only dot the first one because the JSON.parse in this case would be just on the “current” getindex. row. I was getting errors trying that which said the column lengths need to match. Your example explains why. The inner dot makes an array of all rows parsed as a dictionary. The ‘middle’ dot makes an array of the geometry index and the outer dot then broadcasts over all the elements of that array. That is very helpful to see this as a nested example.

The ‘middle’ dot makes an array of the geometry index and the outer dot then broadcasts over all the elements of that array.

Note that no intermediate arrays are created when you use multiple dots in a single broadcasting expression. That is, both of the two expressions below avoid creating any intermediate arrays.

first.(getindex.(JSON.parse.(data.geometry), "coordinates"))
broadcast(x -> first(getindex(JSON.parse(x), "coordinates")), data.geometry)

It’s a very handy feature of the “dot” syntax.

1 Like

Aah! OK that sounds much more efficient. Thanks again!

Interestingly, in all the tutorials on Julia I have done, this is the first time I have seen broadcast as a function name. The examples of broadcasting always use the dotted syntax and never have more than one dot. I find the form using the broadcast function more readable and intuitive just because there are so many dots with two different meanings in the first example - presumably that’s just a syntactic style thing and there is no performance hit of using one over the other?

You can also use @. in front of an expression (or even in front of a line) and everything gets dotted e.g.:

x = range(0, 1, length=100)
y = @. exp(x * cos(x^2))  # equiv to exp.(x .* cos.(x.^2))

( ?@. for more information)

1 Like

That’s brilliant! I came to this forum with a question about weakrefstrings but am walking away with so much more practical info.