using Markdown
using InteractiveUtils
using LinearAlgebra
using Statistics
using MLJ
using CSV
using MultivariateStats
import DataFrames: DataFrame, select, Not, describe
X = CSV.read("zoo.data"; header=[:name, :hair, :feathers, :eggs, :milk, :airborne, :aquatic, :predator, :toothed, :backbone, :breathes, :venomous, :fins, :legs, :tail, :domestic, :catsize, :type])
X_type = select(X, :type)
X = select(X, Not([:type, :name]))
M = fit(PCA, X)
but it gives the error "ERROR: LoadError: MethodError: no method matching fit(::Type{PCA}, ::DataFrame)
Closest candidates are:
fit(!Matched::Type{StatsBase.Histogram}, ::Any…; kwargs…) at …
fit(!Matched::StatsBase.StatisticalModel, ::Any…) at …
fit(!Matched::Type{D}, ::Any) where D<:Distributions.Distribution at …
…
"
I tried converting the dataframe to an array, but it gave the same result, so I must clearly be misunderstanding something. Any ideas?
That works! But is it required to be a float? Most columns are boolean, but that gives an error.
X = iszero.(select(X, Not([:type, :name])))
fit(PCA, Matrix(X))
>>
TypeError: in typeassert, expected Array{Bool,1}, got a value of type Array{Float64,1}
Are you sure you want to perform a standard PCA on boolean data?
You may consider something PCA-like from https://github.com/madeleineudell/LowRankModels.jl
for boolean data they suggest a LogisticLoss. They also support fitting data frames directly.
Yes, it is part of a school assignment to use PCA on this dataset, I will use other methods later. The link looks very useful, I will give it a go tomorrow.
Thank you!