Queryverse queries lose type information?

dlakelan · June 1, 2020, 6:03pm

I’m working on a tutorial and have the following code:

dflong2 = DataFrame(dflong |> @mutate(year = tryparse(Int,String(_.variable)[end-3:end])))

obviously I’m modifying the year column to be the result of trying to parse the last 4 characters of the variable column into an integer because someone at the Census decided that a wide form table with POPESTIMATE2010, POPESTIMATE2011,… was a good idea.

EDIT: That actually works… But then when I go to strip the year off the symbols in dflong2 I get:

pop = dflong2 |> @filter(! (_.year isa Nothing))|> @map(variable = Symbol(String(_.variable)[1:end-3]))|>DataFrame

It doesn’t work… Why? Because although each column in dflong has a particular type, after converting the query pipeline to a DataFrame == dflong2, the fields have type DataValue{Any} and although the variable field is always a symbol, I can’t convert a symbol to a string when Julia doesn’t know what type it is:

MethodError: no method matching String(::DataValue{Any})
Closest candidates are:
  String(!Matched::String) at boot.jl:321
  String(!Matched::Array{UInt8,1}) at strings/string.jl:39
  String(!Matched::Base.CodeUnits{UInt8,String}) at strings/string.jl:77
  ...

The complete loss of all type information seems to be a non-starter for Query! I can’t believe that’s correct, so what am I doing wrong?

nilshg · June 1, 2020, 6:43pm

Not sure I understand - are POPESTIMATE2010 column names? And you’ve already stacked the dataframe so now you have a column that holds all the former column names?

I’d probably just do

parse.(Int, replace.(string.(df.variable), "POPESTIMATE" => ""))

(I can’t really comment on the Query solution as I find base DataFrames flexible enough for my needs, but I’m sure someone will be along to solve that problem as well shortly!)

dlakelan · June 1, 2020, 6:47pm

Yes, that’s correct. I had an error in my original post and made some edits, please see that… The parse that I did in the first query pipeline did work… but then when I go to modify the resulting data frame, I want to take the symbols POPESTIMATE2010 and soforth and strip off all the digits at the end, so that I have POPESTIMATE as the key, and the year is already a separate column… and it doesn’t work because I can’t convert an Any to a String and in the first pipeline it lost all type info!

Also, to give some more background. As is typical for data put out by Govt entities, they make it easy for Excel users, and HARD for the real pros. In fact, there are about 5 or 6 variables each of which has a series of 10 columns for the data 2010 through 2019. So I can’t just strip out “POPESTIMATE” because there is also “BIRTHS” and “DEATHS” and “MIGRATIONOUT” or whatever… So the first thing I did was run through the column stripping off the last 4 digits, and converting it to a year. so each row has an associated year… and then I decided I wanted to drop the year from the end of the symbols… and it was at that point that I discovered all the type info was lost.

nilshg · June 1, 2020, 7:04pm

You can retrain the first part of the information by splitting:

julia> using DataFrames 

julia> df = DataFrame(variable = ["POPESTIMATE2018", "POPESTIMATE2019", "BIRTHS2018", "BIRTHS2019"])
4×1 DataFrame
│ Row │ variable        │
│     │ String          │
├─────┼─────────────────┤
│ 1   │ POPESTIMATE2018 │
│ 2   │ POPESTIMATE2019 │
│ 3   │ BIRTHS2018      │
│ 4   │ BIRTHS2019      │

julia> using Dates

julia> df[!, "year"] = Date.(last.(split.(df.variable, "20"))) .+ Year(2000);

julia> df[!, "variable"] = first.(split.(df.variable, "20"));

julia> df
4×2 DataFrame
│ Row │ variable    │ year       │
│     │ SubString…  │ Date       │
├─────┼─────────────┼────────────┤
│ 1   │ POPESTIMATE │ 2018-01-01 │
│ 2   │ POPESTIMATE │ 2019-01-01 │
│ 3   │ BIRTHS      │ 2018-01-01 │
│ 4   │ BIRTHS      │ 2019-01-01 │

dlakelan · June 1, 2020, 7:39pm

That’s a nice method, and it solves the problem of how to do this particular thing, sort of… undoubtedly someone is going to do POP18-20_2010, POP18-20_2011 because they are always out to get you. The only guarantee I have is that if the last 4 characters of the string are digits, then they’re the year.

The real question here though is, why when I run the Query pipeline do I lose all my type information? Because in the long run I want to use Query pipelines, but not if they result in information that has no type and can’t be modified or computed with.

I see that the simplest case doesn’t lose type info:

 df
3×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ Symbol │
├─────┼───────┼────────┤
│ 1   │ 1     │ foo    │
│ 2   │ 2     │ bar    │
│ 3   │ 3     │ baz    │

julia>  df |> @mutate(b=String(_.b)) |> DataFrame
3×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ foo    │
│ 2   │ 2     │ bar    │
│ 3   │ 3     │ baz    │

But something destroyed the types in my more complex query. Sigh…

nilshg · June 1, 2020, 7:41pm

I only just see that you’re using String - this is a type constructor, for conversion of a wider range of inputs into a String object you should use the string function (lowercase, which by convention in Julia is a function name, as compared to uppercase which is conventionally use to denote types)

dlakelan · June 1, 2020, 8:50pm

Here’s where things went wrong:

dflong.year=fill(-1,nrow(dflong))
display(first(dflong,10))
dflong2 = DataFrame(dflong |> @mutate(year = tryparse(Int,string(_.variable)[end-3:end])))
display(first(dflong2,10))

The first display shows a table with data types (copied and pasted from Jupyter, sorry if the formatting is not great)

variable	value	SUMLEV	REGION	DIVISION	STATE	COUNTY	STNAME	CTYNAME
	Symbol	Float64	Int64	Int64	Int64	Int64	Int64	String	String
1	CENSUS2010POP	4.77974e6	40	3	6	1	0	Alabama	Alabama
2	CENSUS2010POP	54571.0	50	3	6	1	1	Alabama	Autauga County
3	CENSUS2010POP	182265.0	50	3	6	1	3	Alabama	Baldwin County
4	CENSUS2010POP	27457.0	50	3	6	1	5	Alabama	Barbour County
5	CENSUS2010POP	22915.0	50	3	6	1	7	Alabama	Bibb County
6	CENSUS2010POP	57322.0	50	3	6	1	9	Alabama	Blount County
7	CENSUS2010POP	10914.0	50	3	6	1	11	Alabama	Bullock County
8	CENSUS2010POP	20947.0	50	3	6	1	13	Alabama	Butler County
9	CENSUS2010POP	118572.0	50	3	6	1	15	Alabama	Calhoun County
10	CENSUS2010POP	34215.0	50	3	6	1	17	Alabama	Chambers County

The second display shows a table with no datatypes in any column:

variable	value	SUMLEV	REGION	DIVISION	STATE	COUNTY	STNAME	CTYNAME
	Any	Any	Any	Any	Any	Any	Any	Any	Any
1	CENSUS2010POP	4.77974e6	40	3	6	1	0	Alabama	Alabama
2	CENSUS2010POP	54571.0	50	3	6	1	1	Alabama	Autauga County
3	CENSUS2010POP	182265.0	50	3	6	1	3	Alabama	Baldwin County
4	CENSUS2010POP	27457.0	50	3	6	1	5	Alabama	Barbour County
5	CENSUS2010POP	22915.0	50	3	6	1	7	Alabama	Bibb County
6	CENSUS2010POP	57322.0	50	3	6	1	9	Alabama	Blount County
7	CENSUS2010POP	10914.0	50	3	6	1	11	Alabama	Bullock County
8	CENSUS2010POP	20947.0	50	3	6	1	13	Alabama	Butler County
9	CENSUS2010POP	118572.0	50	3	6	1	15	Alabama	Calhoun County
10	CENSUS2010POP	34215.0	50	3	6	1	17	Alabama	Chambers County

As soon as everything is an “Any” then as you pointed out, String won’t construct a string from a Symbol stored in an Any…

As you pointed out lower case string does do the conversion but it doesn’t do it correctly. Instead of giving the characters that make up the symbol name, it gives the string “DataValue{Any}(:SYMBOLHERE)” which then gets stripped by my conversion, incorrectly…

pop = dflong2 |> @filter(! (_.year isa Nothing))|> @mutate(variable = Symbol(string(_.variable)[1:end-3]))|>DataFrame
display(first(pop,3))

	variable	value	SUMLEV	REGION	DIVISION	STATE	COUNTY
	Symbol	Any	Any	Any	Any	Any	Any
1	DataValue{Any}(:CENSUS2010P	4.77974e6	40	3	6	1	0
2	DataValue{Any}(:CENSUS2010P	54571.0	50	3	6	1	1
3	DataValue{Any}(:CENSUS2010P	182265.0	50	3	6	1	3

So the data becomes fairly useless as soon as everything becomes an Any

nilshg · June 1, 2020, 9:12pm

I never got around to figuring out this whole DataValue business in the queryverse - we might have to summon @davidanthoff to shed light on this

dlakelan · June 1, 2020, 9:17pm

I have a suspicion that tryparse, which can return all kinds of data types, might be forcing a conversion to DataValue{Any}. I’ll see if I can make that type-stable and see what results.

learning … learning…

Indeed:

dflong2 = DataFrame(dflong |> @mutate(year = something(tryparse(Int,string(_.variable)[end-3:end]),0)))
display(first(dflong2,10))

gives a table with proper data types.

In this case it’s OK to encode meaningless data as 0. But I find it weird that I lose ALL type info rather than just the particular column “year”.

Topic		Replies	Views
DataFrames: convert column data type Data type , dataframes , convert	65	28258	May 11, 2023
Query @mutate changes other column type New to Julia query , dataframes , csv	8	1099	May 13, 2020
Bug with Union type in Query.jl? Data	2	487	May 5, 2019
How to convert String type columns of a dataframe to a DateTime type and a Float64 type, using either plain DataFrames.jl or Query.jl Data	4	1410	January 18, 2021
Query.jl @mutate command does not preserve column types General Usage	4	612	June 12, 2020

Queryverse queries lose type information?

Related topics