Query.jl @mutate command does not preserve column types

I have a dataframe that I read from CSV. I.e.

df = CSV.read("myfile.csv" , header=1, select=(i, name) -> i < 5 && return true)

The parser automatically gets the correct type of each column

92×5 DataFrame
│ Row │ neck    │ pfat    │ weight  │ activity │ pfat_weight │
│     │ Float64 │ Float64 │ Float64 │ Float64  │ Float64     │
├─────┼─────────┼─────────┼─────────┼──────────┼─────────────┤
│ 1   │ 0.934   │ 25.3    │ 52.1631 │ 3508.44  │ 1319.73     │
│ 2   │ 0.888   │ 29.3    │ 61.802  │ 2773.54  │ 1810.8      │
│ 3   │ 0.933   │ 37.7    │ 93.44   │ 1738.97  │ 3522.69     │
│ 4   │ 0.757   │ 32.8    │ 59.8742 │ 1665.29  │ 1963.87     │

Now suppose I mutate this data using @mutate from Query.jl. I basically want to standardize some of the columns (i.e. x_i - mean(x)).

df_s = df |> @mutate(pfat_s = _.pfat - mean(df.pfat), 
              weight_s = _.weight - mean(df.weight), 
              activity_s = _.activity - mean(df.activity), 
              pfat_weight_s = _.pfat_weight - mean(df.pfat_weight)) |> DataFrame

Notice the mean(df.pfat) in there. I am not sure this is an efficient way. Anyways, lets look at the resulting column types:

julia> df_s
92×9 DataFrame
│ Row │ neck  │ pfat │ weight  │ activity │ pfat_weight │ pfat_s   │ weight_s │ activity_s │ pfat_weight_s │
│     │ Any   │ Any  │ Any     │ Any      │ Any         │ Any      │ Any      │ Any        │ Any           │
├─────┼───────┼──────┼─────────┼──────────┼─────────────┼──────────┼──────────┼────────────┼───────────────┤
│ 1   │ 0.934 │ 25.3 │ 52.1631 │ 3508.44  │ 1319.73     │ -3.26522 │ -1.76507 │ 946.45     │ -307.046      │
│ 2   │ 0.888 │ 29.3 │ 61.802  │ 2773.54  │ 1810.8      │ 0.734783 │ 7.87377  │ 211.55     │ 184.024       │
│ 3   │ 0.933 │ 37.7 │ 93.44   │ 1738.97  │ 3522.69     │ 9.13478  │ 39.5118  │ -823.02    │ 1895.92       │
│ 4   │ 0.757 │ 32.8 │ 59.8742 │ 1665.29  │ 1963.87     │ 4.23478  │ 5.946    │ -896.7     │ 337.1         

Why did all the columns change to type Any? Even the original columns which were inferred correctly are now typed as Any.



I also noticed a very small, subtle change in the two dataframes. If I select a column from the original dataframe,

julia> df.pfat
92-element CSV.Column{Float64,Float64}:

the type is CSV.Column. If I select a column from the new mutated dataframe, the type is

julia> df_s.pfat
92-element Array{Any,1}:

Why this subtle difference?
Edit: I realized that CSV.read dosn’t really return a DataFrame. I have to do |> DataFrame. So that actually explains the small difference I see in column types.

Well I found out that the type change happens because of the mean function but I am not really sure why.

Indeed, I noticed this as well here: Queryverse queries lose type information? - #9 by dlakelan

I managed to figure out how to go around this in my case, but didn’t address the bigger issue. WDYT @davidanthoff? I didn’t see anything obvious in the documentation about how to keep the types stable.

Well the problem I am having is that after the @mutate command, I can’t even use GLM anymore. If df_s is my mutated dataframe, then I get

julia> smodel = lm(@formula(neck ~ pfat_s + weight_s + activity_s + pfat_weight_s), df_s) 
ERROR: MethodError: no method matching fit(::Type{LinearModel}, ::Array{Float64,2}, ::Array{Float64,2}, ::Bool)

It seems to me that by adding in the mean command, there is some internal problem where the column is not an 1 dimensional array anymore. You can kind of see the error from GLM

no method matching fit(::Type{LinearModel}, ::Array{Float64,2}, ::Array{Float64,2}, ::Bool)

which shows that somewhere there is an Array{Float64, 2}.

I am not at the level to start debugging this, so hopefully someone can figure this out.

I suspect that the return type of mean can’t be figured out inside the anonymous function that @mutate is using… however I’d like to suggest that you do:

df_s.pfat_s = df_s.pfat_s .- mean(df_s.pfat_s) 
...

this will calculate the mean once rather than for every row, and it should have an easier time doing type inference.

Let me know if that works.