I’m working on a tutorial and have the following code:
dflong2 = DataFrame(dflong |> @mutate(year = tryparse(Int,String(_.variable)[end-3:end])))
obviously I’m modifying the year column to be the result of trying to parse the last 4 characters of the variable column into an integer because someone at the Census decided that a wide form table with POPESTIMATE2010, POPESTIMATE2011,… was a good idea.
EDIT: That actually works… But then when I go to strip the year off the symbols in dflong2 I get:
pop = dflong2 |> @filter(! (_.year isa Nothing))|> @map(variable = Symbol(String(_.variable)[1:end-3]))|>DataFrame
It doesn’t work… Why? Because although each column in dflong has a particular type, after converting the query pipeline to a DataFrame == dflong2, the fields have type DataValue{Any} and although the variable field is always a symbol, I can’t convert a symbol to a string when Julia doesn’t know what type it is:
MethodError: no method matching String(::DataValue{Any})
Closest candidates are:
String(!Matched::String) at boot.jl:321
String(!Matched::Array{UInt8,1}) at strings/string.jl:39
String(!Matched::Base.CodeUnits{UInt8,String}) at strings/string.jl:77
...
The complete loss of all type information seems to be a non-starter for Query! I can’t believe that’s correct, so what am I doing wrong?
Not sure I understand - are POPESTIMATE2010
column names? And you’ve already stacked the dataframe so now you have a column that holds all the former column names?
I’d probably just do
parse.(Int, replace.(string.(df.variable), "POPESTIMATE" => ""))
(I can’t really comment on the Query solution as I find base DataFrames flexible enough for my needs, but I’m sure someone will be along to solve that problem as well shortly!)
Yes, that’s correct. I had an error in my original post and made some edits, please see that… The parse that I did in the first query pipeline did work… but then when I go to modify the resulting data frame, I want to take the symbols POPESTIMATE2010 and soforth and strip off all the digits at the end, so that I have POPESTIMATE as the key, and the year is already a separate column… and it doesn’t work because I can’t convert an Any to a String and in the first pipeline it lost all type info!
Also, to give some more background. As is typical for data put out by Govt entities, they make it easy for Excel users, and HARD for the real pros. In fact, there are about 5 or 6 variables each of which has a series of 10 columns for the data 2010 through 2019. So I can’t just strip out “POPESTIMATE” because there is also “BIRTHS” and “DEATHS” and “MIGRATIONOUT” or whatever… So the first thing I did was run through the column stripping off the last 4 digits, and converting it to a year. so each row has an associated year… and then I decided I wanted to drop the year from the end of the symbols… and it was at that point that I discovered all the type info was lost.
You can retrain the first part of the information by splitting:
julia> using DataFrames
julia> df = DataFrame(variable = ["POPESTIMATE2018", "POPESTIMATE2019", "BIRTHS2018", "BIRTHS2019"])
4×1 DataFrame
│ Row │ variable │
│ │ String │
├─────┼─────────────────┤
│ 1 │ POPESTIMATE2018 │
│ 2 │ POPESTIMATE2019 │
│ 3 │ BIRTHS2018 │
│ 4 │ BIRTHS2019 │
julia> using Dates
julia> df[!, "year"] = Date.(last.(split.(df.variable, "20"))) .+ Year(2000);
julia> df[!, "variable"] = first.(split.(df.variable, "20"));
julia> df
4×2 DataFrame
│ Row │ variable │ year │
│ │ SubString… │ Date │
├─────┼─────────────┼────────────┤
│ 1 │ POPESTIMATE │ 2018-01-01 │
│ 2 │ POPESTIMATE │ 2019-01-01 │
│ 3 │ BIRTHS │ 2018-01-01 │
│ 4 │ BIRTHS │ 2019-01-01 │
That’s a nice method, and it solves the problem of how to do this particular thing, sort of… undoubtedly someone is going to do POP18-20_2010, POP18-20_2011 because they are always out to get you. The only guarantee I have is that if the last 4 characters of the string are digits, then they’re the year.
The real question here though is, why when I run the Query pipeline do I lose all my type information? Because in the long run I want to use Query pipelines, but not if they result in information that has no type and can’t be modified or computed with.
I see that the simplest case doesn’t lose type info:
df
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Symbol │
├─────┼───────┼────────┤
│ 1 │ 1 │ foo │
│ 2 │ 2 │ bar │
│ 3 │ 3 │ baz │
julia> df |> @mutate(b=String(_.b)) |> DataFrame
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ foo │
│ 2 │ 2 │ bar │
│ 3 │ 3 │ baz │
But something destroyed the types in my more complex query. Sigh…
I only just see that you’re using String
- this is a type constructor, for conversion of a wider range of inputs into a String
object you should use the string
function (lowercase, which by convention in Julia is a function name, as compared to uppercase which is conventionally use to denote types)
1 Like
Here’s where things went wrong:
dflong.year=fill(-1,nrow(dflong))
display(first(dflong,10))
dflong2 = DataFrame(dflong |> @mutate(year = tryparse(Int,string(_.variable)[end-3:end])))
display(first(dflong2,10))
The first display shows a table with data types (copied and pasted from Jupyter, sorry if the formatting is not great)
variable value SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME
Symbol Float64 Int64 Int64 Int64 Int64 Int64 String String
1 CENSUS2010POP 4.77974e6 40 3 6 1 0 Alabama Alabama
2 CENSUS2010POP 54571.0 50 3 6 1 1 Alabama Autauga County
3 CENSUS2010POP 182265.0 50 3 6 1 3 Alabama Baldwin County
4 CENSUS2010POP 27457.0 50 3 6 1 5 Alabama Barbour County
5 CENSUS2010POP 22915.0 50 3 6 1 7 Alabama Bibb County
6 CENSUS2010POP 57322.0 50 3 6 1 9 Alabama Blount County
7 CENSUS2010POP 10914.0 50 3 6 1 11 Alabama Bullock County
8 CENSUS2010POP 20947.0 50 3 6 1 13 Alabama Butler County
9 CENSUS2010POP 118572.0 50 3 6 1 15 Alabama Calhoun County
10 CENSUS2010POP 34215.0 50 3 6 1 17 Alabama Chambers County
The second display shows a table with no datatypes in any column:
variable value SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME
Any Any Any Any Any Any Any Any Any
1 CENSUS2010POP 4.77974e6 40 3 6 1 0 Alabama Alabama
2 CENSUS2010POP 54571.0 50 3 6 1 1 Alabama Autauga County
3 CENSUS2010POP 182265.0 50 3 6 1 3 Alabama Baldwin County
4 CENSUS2010POP 27457.0 50 3 6 1 5 Alabama Barbour County
5 CENSUS2010POP 22915.0 50 3 6 1 7 Alabama Bibb County
6 CENSUS2010POP 57322.0 50 3 6 1 9 Alabama Blount County
7 CENSUS2010POP 10914.0 50 3 6 1 11 Alabama Bullock County
8 CENSUS2010POP 20947.0 50 3 6 1 13 Alabama Butler County
9 CENSUS2010POP 118572.0 50 3 6 1 15 Alabama Calhoun County
10 CENSUS2010POP 34215.0 50 3 6 1 17 Alabama Chambers County
As soon as everything is an “Any” then as you pointed out, String won’t construct a string from a Symbol stored in an Any…
As you pointed out lower case string
does do the conversion but it doesn’t do it correctly. Instead of giving the characters that make up the symbol name, it gives the string “DataValue{Any}(:SYMBOLHERE)” which then gets stripped by my conversion, incorrectly…
pop = dflong2 |> @filter(! (_.year isa Nothing))|> @mutate(variable = Symbol(string(_.variable)[1:end-3]))|>DataFrame
display(first(pop,3))
variable value SUMLEV REGION DIVISION STATE COUNTY
Symbol Any Any Any Any Any Any
1 DataValue{Any}(:CENSUS2010P 4.77974e6 40 3 6 1 0
2 DataValue{Any}(:CENSUS2010P 54571.0 50 3 6 1 1
3 DataValue{Any}(:CENSUS2010P 182265.0 50 3 6 1 3
So the data becomes fairly useless as soon as everything becomes an Any
I never got around to figuring out this whole DataValue
business in the queryverse - we might have to summon @davidanthoff to shed light on this
I have a suspicion that tryparse, which can return all kinds of data types, might be forcing a conversion to DataValue{Any}. I’ll see if I can make that type-stable and see what results.
learning … learning…
Indeed:
dflong2 = DataFrame(dflong |> @mutate(year = something(tryparse(Int,string(_.variable)[end-3:end]),0)))
display(first(dflong2,10))
gives a table with proper data types.
In this case it’s OK to encode meaningless data as 0. But I find it weird that I lose ALL type info rather than just the particular column “year”.