I have a dataframe that I read from CSV. I.e.
df = CSV.read("myfile.csv" , header=1, select=(i, name) -> i < 5 && return true)
The parser automatically gets the correct type of each column
92×5 DataFrame
│ Row │ neck │ pfat │ weight │ activity │ pfat_weight │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼──────────┼─────────────┤
│ 1 │ 0.934 │ 25.3 │ 52.1631 │ 3508.44 │ 1319.73 │
│ 2 │ 0.888 │ 29.3 │ 61.802 │ 2773.54 │ 1810.8 │
│ 3 │ 0.933 │ 37.7 │ 93.44 │ 1738.97 │ 3522.69 │
│ 4 │ 0.757 │ 32.8 │ 59.8742 │ 1665.29 │ 1963.87 │
Now suppose I mutate this data using @mutate
from Query.jl. I basically want to standardize some of the columns (i.e. x_i - mean(x)).
df_s = df |> @mutate(pfat_s = _.pfat - mean(df.pfat),
weight_s = _.weight - mean(df.weight),
activity_s = _.activity - mean(df.activity),
pfat_weight_s = _.pfat_weight - mean(df.pfat_weight)) |> DataFrame
Notice the mean(df.pfat)
in there. I am not sure this is an efficient way. Anyways, lets look at the resulting column types:
julia> df_s
92×9 DataFrame
│ Row │ neck │ pfat │ weight │ activity │ pfat_weight │ pfat_s │ weight_s │ activity_s │ pfat_weight_s │
│ │ Any │ Any │ Any │ Any │ Any │ Any │ Any │ Any │ Any │
├─────┼───────┼──────┼─────────┼──────────┼─────────────┼──────────┼──────────┼────────────┼───────────────┤
│ 1 │ 0.934 │ 25.3 │ 52.1631 │ 3508.44 │ 1319.73 │ -3.26522 │ -1.76507 │ 946.45 │ -307.046 │
│ 2 │ 0.888 │ 29.3 │ 61.802 │ 2773.54 │ 1810.8 │ 0.734783 │ 7.87377 │ 211.55 │ 184.024 │
│ 3 │ 0.933 │ 37.7 │ 93.44 │ 1738.97 │ 3522.69 │ 9.13478 │ 39.5118 │ -823.02 │ 1895.92 │
│ 4 │ 0.757 │ 32.8 │ 59.8742 │ 1665.29 │ 1963.87 │ 4.23478 │ 5.946 │ -896.7 │ 337.1
Why did all the columns change to type Any
? Even the original columns which were inferred correctly are now typed as Any
.
I also noticed a very small, subtle change in the two dataframes. If I select a column from the original dataframe,
julia> df.pfat
92-element CSV.Column{Float64,Float64}:
the type is CSV.Column
. If I select a column from the new mutated dataframe, the type is
julia> df_s.pfat
92-element Array{Any,1}:
Why this subtle difference?
Edit: I realized that CSV.read
dosn’t really return a DataFrame. I have to do |> DataFrame
. So that actually explains the small difference I see in column types.