Multivariatestats: No method matching fit for PCA & DataFrame

Hi! I am currently trying to learn both ML and Julia and I am looking at this dataset, and wish to perform PCA on it. UCI Machine Learning Repository: Zoo Data Set
Reading the example in Principal Component Analysis — MultivariateStats 0.1.0 documentation I expected the following to work.

using Markdown
using InteractiveUtils
using LinearAlgebra
using Statistics
using MLJ
using CSV
using MultivariateStats
import DataFrames: DataFrame, select, Not, describe

X = CSV.read("zoo.data"; header=[:name, :hair, :feathers, :eggs, :milk, :airborne, :aquatic, :predator, :toothed, :backbone, :breathes, :venomous, :fins, :legs, :tail, :domestic, :catsize, :type])
X_type = select(X, :type)
X = select(X, Not([:type, :name]))

M = fit(PCA, X)

but it gives the error "ERROR: LoadError: MethodError: no method matching fit(::Type{PCA}, ::DataFrame)
Closest candidates are:
fit(!Matched::Type{StatsBase.Histogram}, ::Any…; kwargs…) at …
fit(!Matched::StatsBase.StatisticalModel, ::Any…) at …
fit(!Matched::Type{D}, ::Any) where D<:Distributions.Distribution at …

"
I tried converting the dataframe to an array, but it gave the same result, so I must clearly be misunderstanding something. Any ideas?

Should work with an array:

julia> using DataFrames, MLJ, MultivariateStats

julia> data = DataFrame(rand(100, 5));

julia> fit(PCA, Matrix(data))
PCA(indim = 100, outdim = 4, principalratio = 1.0)

I also imported MLJ here since I thought MLJ exports a fit function as well which could have clashed, but that’s actually not the case.

That works! But is it required to be a float? Most columns are boolean, but that gives an error.

X = iszero.(select(X, Not([:type, :name])))
fit(PCA, Matrix(X))
>>
TypeError: in typeassert, expected Array{Bool,1}, got a value of type Array{Float64,1}

Thank you!

Are you sure you want to perform a standard PCA on boolean data?
You may consider something PCA-like from https://github.com/madeleineudell/LowRankModels.jl
for boolean data they suggest a LogisticLoss. They also support fitting data frames directly.

2 Likes

Yes, it is part of a school assignment to use PCA on this dataset, I will use other methods later. The link looks very useful, I will give it a go tomorrow.
Thank you!

Check out the dimensions at the end, here.

PCA treats columns as observations in a matrix, so you have to input transpose(Matrix(data))

3 Likes

Does transpose(Matrix(data)) work for you? I can not get it working.

Works for me:

julia> df = DataFrame(rand(100, 20));

julia> fit(PCA, transpose(Matrix(df)), maxoutdim = 5)
PCA(indim = 20, outdim = 5, principalratio = 0.4258331088614564)
1 Like

Sorry, I should clarify. That works for me too, but trying to use it on the dataframe X in the original post does not.