PCA not working

Hello,
I’m working on the PCA from the MultivariateStats package and get the following error message (I’m working with ATOM):

julia> PCA_Modell = fit(PCA, PCA_Daten)
ERROR: MethodError: no method matching fit(::Type{PCA}, ::Array{Union{Missing, Float64},2})
Closest candidates are:
  fit(::Type{Histogram}, ::Any...; kwargs...) at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\StatsBase\yNoh6\src\hist.jl:319
  fit(::StatisticalModel, ::Any...) at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\StatsBase\yNoh6\src\statmodels.jl:151
  fit(::Type{D<:Distribution}, ::Any) where D<:Distribution at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\Distributions\WHjOk\src\genericfit.jl:33
  ...
Stacktrace:
 [1] top-level scope at none:0

I’m trying to recreate an example that I did with R. In R I have to specify the expected number of components. If this is not the case here (I refer to the doc). The data I use comes from a DataFrame which I have converted into an array:

julia> PCA_Daten
1599×11 Array{Union{Missing, Float64},2}:
  7.4  0.7    0.0   1.9  0.076  11.0  34.0  0.9978   3.51  0.56   9.4
  7.8  0.88   0.0   2.6  0.098  25.0  67.0  0.9968   3.2   0.68   9.8
  7.8  0.76

I hope it’s not too simple, but I’m grateful for any hint!

Thank’s
Günter

I think your problem might be missings - can you drop missing variables from your PCA_Daten Array?

fit(PCA, rand(1_000, 10)) works fine, while

X = Array{Union{Misisng, Float64}}(undef, 1_000, 10)
X[:,:].=rand(1_000,10)
fit(PCA, X)

produces the error you see (although X here doesn’t include missings anymore), so it seems related to the type of X.

1 Like

Thanks for your quick answer. The DataFrame (as base) does not contain any missing values:

11×5 DataFrame
│ Row │ variable             │ eltype   │ nmissing │ first   │ last    │
│     │ Symbol               │ DataType │ Int64    │ Float64 │ Float64 │
├─────┼──────────────────────┼──────────┼──────────┼─────────┼─────────┤
│ 1   │ fixed_acidity        │ Float64  │ 0        │ 7.4     │ 6.0     │
│ 2   │ volatile_acidity     │ Float64  │ 0        │ 0.7     │ 0.31    │
│ 3   │ citric_acid          │ Float64  │ 0        │ 0.0     │ 0.47    │
│ 4   │ residual_sugar       │ Float64  │ 0        │ 1.9     │ 3.6     │
│ 5   │ chlorides            │ Float64  │ 0        │ 0.076   │ 0.067   │
│ 6   │ free_sulfur_dioxide  │ Float64  │ 0        │ 11.0    │ 18.0    │
│ 7   │ total_sulfur_dioxide │ Float64  │ 0        │ 34.0    │ 42.0    │
│ 8   │ density              │ Float64  │ 0        │ 0.9978  │ 0.99549 │
│ 9   │ pH                   │ Float64  │ 0        │ 3.51    │ 3.39    │
│ 10  │ sulphates            │ Float64  │ 0        │ 0.56    │ 0.66    │
│ 11  │ alcohol              │ Float64  │ 0        │ 9.4     │ 11.0    │

I’m watching this clue

Array{Union{Missing, ...

regularly when I work with DataFrames. And I always do, because my database is Excel.

Even if I check for missing values, the check shows that there are no missing values (do I use the check function correctly?):

julia> isequal(missing, PCA_Daten)
false

Do you have another clue? Thank you!
Regards,
Günter

The documentation for describe could be improved, but it does contain this hint:

Missing values are filtered in the calculation of all statistics, however the column :nmissing will report the number of missing values of that variable. If the column does not allow missing values, nothing is returned. Consequently, nmissing = 0 indicates that the column allows missing values, but does not currently contain any.

I recognize that this is confusing. And have filed an issue in DataFrames here.

1 Like

You can use disallowmissing!.

1 Like

Thank you for this clarification! There are also no missing values, so it fits! :grinning:

Thanks for the hint! As a data analyst, I’m very interested in knowing if the record contains missing values. Then I have this info and can note in the report that “x missing values are present in the record”. For my own analysis I have to know how to deal with missing values. Therefore I am pleased about every clear and meaningful Julia message! :man_office_worker:

1 Like

Maybe I don’t understand the concept of function. For example, if I want to perform factor analysis, I get the same error message:

julia> FA_Modell = fit(FactorAnalysis, FA_Daten)
ERROR: MethodError: no method matching fit(::Type{FactorAnalysis}, ::Array{Union{Missing, Float64},2})
Closest candidates are:
  fit(::Type{Histogram}, ::Any...; kwargs...) at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\StatsBase\yNoh6\src\hist.jl:319
  fit(::StatisticalModel, ::Any...) at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\StatsBase\yNoh6\src\statmodels.jl:151
  fit(::Type{D<:Distribution}, ::Any) where D<:Distribution at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\Distributions\WHjOk\src\genericfit.jl:33
  ...
Stacktrace:
 [1] top-level scope at none:0

It’s the same data and if I interpret the message correctly, the method fit won’t be found, right?

What am I doing wrong, what’s wrong? I can’t imagine being the first to use these features since the last release…

I am grateful for any hint,
Günter

You need to transform your X array to a type without missing (the error message indicates that fit can not handle the type Array{Union{Missing, Float64},2})). This Union type only indicates that it is possible to have missing values.
If you do not have any missing value in your data you can simply use X=coalesce.(X) to get a simple Array{Union{Missing, Float64},2}) and pass it to fit.
If you have indeed missing data, you will need to decide what to do with them (drop them, replace them with other values etc…)

2 Likes

Thanks for this hint, it works! The array does not contain any missing values, but the coalesce function “removes” the missing from the array.

Thank you very much!
Regards,
Günter

I still think this is all about the type of your second argument: the error message tells you that there is no method for the function fit() which takes an Array{Union{Missing, Float64},2} as an argument. The important thing to consider here is that in Julia, arrays that contain missing values are of a different type than those without missing - they are a type union of {Missing, T} (note the capitalised Missing which is the type of the value missing), where T is the type of the non-missing values.

When you read in the DataFrame, the DataFrames package automatically generates columns of the type {Missing, T} to accomodate potential missings in your data set. If you don’t have these, you can call disallowmissing! on your DataFram, which will convert the column type to just T (i.e. remove the type union). The key thing is that the second argument to fit() should be of type Array{Float64, 2 , rather than a type union including Missing.

IMO that’s bad API design, the method should not rely on a strict elemenet type.

Instead, it should work for all AbstractMatrix, and verify required conditions, eg that all the elements are <: Real. In the ideal case, eg with a Matrix{Float64}, this check is costless.

Cf the first two points here.

Just to be clear (as I’m currently grappling with a similar API decision for an econometric estimator, for which I probably don’t want to accept missing in the first instance), you are advocating for having the method fit(model::AbstractModelType, data::AbstractMatrix) and then run a suite of checks e.g. for the presence of missing in the function body, to then throw more specific errors rather than the MethodError?

Yes. I think that’s the right way of organizing the API.

In the ideal case, container types are concrete and tight, so that case should be fast, and the “check” is a no-op. But the code should with even if they aren’t, but the values themselves are valid.

Cf

julia> v = Any[1, 2, 3]
3-element Array{Any,1}:
 1
 2
 3

julia> sum(v)
6
1 Like