PCA not working

#1

Hello,
I’m working on the PCA from the MultivariateStats package and get the following error message (I’m working with ATOM):

julia> PCA_Modell = fit(PCA, PCA_Daten)
ERROR: MethodError: no method matching fit(::Type{PCA}, ::Array{Union{Missing, Float64},2})
Closest candidates are:
  fit(::Type{Histogram}, ::Any...; kwargs...) at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\StatsBase\yNoh6\src\hist.jl:319
  fit(::StatisticalModel, ::Any...) at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\StatsBase\yNoh6\src\statmodels.jl:151
  fit(::Type{D<:Distribution}, ::Any) where D<:Distribution at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\Distributions\WHjOk\src\genericfit.jl:33
  ...
Stacktrace:
 [1] top-level scope at none:0

I’m trying to recreate an example that I did with R. In R I have to specify the expected number of components. If this is not the case here (I refer to the doc). The data I use comes from a DataFrame which I have converted into an array:

julia> PCA_Daten
1599×11 Array{Union{Missing, Float64},2}:
  7.4  0.7    0.0   1.9  0.076  11.0  34.0  0.9978   3.51  0.56   9.4
  7.8  0.88   0.0   2.6  0.098  25.0  67.0  0.9968   3.2   0.68   9.8
  7.8  0.76

I hope it’s not too simple, but I’m grateful for any hint!

Thank’s
Günter

#2

I think your problem might be missings - can you drop missing variables from your PCA_Daten Array?

fit(PCA, rand(1_000, 10)) works fine, while

X = Array{Union{Misisng, Float64}}(undef, 1_000, 10)
X[:,:].=rand(1_000,10)
fit(PCA, X)

produces the error you see (although X here doesn’t include missings anymore), so it seems related to the type of X.

1 Like
#3

Thanks for your quick answer. The DataFrame (as base) does not contain any missing values:

11×5 DataFrame
│ Row │ variable             │ eltype   │ nmissing │ first   │ last    │
│     │ Symbol               │ DataType │ Int64    │ Float64 │ Float64 │
├─────┼──────────────────────┼──────────┼──────────┼─────────┼─────────┤
│ 1   │ fixed_acidity        │ Float64  │ 0        │ 7.4     │ 6.0     │
│ 2   │ volatile_acidity     │ Float64  │ 0        │ 0.7     │ 0.31    │
│ 3   │ citric_acid          │ Float64  │ 0        │ 0.0     │ 0.47    │
│ 4   │ residual_sugar       │ Float64  │ 0        │ 1.9     │ 3.6     │
│ 5   │ chlorides            │ Float64  │ 0        │ 0.076   │ 0.067   │
│ 6   │ free_sulfur_dioxide  │ Float64  │ 0        │ 11.0    │ 18.0    │
│ 7   │ total_sulfur_dioxide │ Float64  │ 0        │ 34.0    │ 42.0    │
│ 8   │ density              │ Float64  │ 0        │ 0.9978  │ 0.99549 │
│ 9   │ pH                   │ Float64  │ 0        │ 3.51    │ 3.39    │
│ 10  │ sulphates            │ Float64  │ 0        │ 0.56    │ 0.66    │
│ 11  │ alcohol              │ Float64  │ 0        │ 9.4     │ 11.0    │

I’m watching this clue

Array{Union{Missing, ...

regularly when I work with DataFrames. And I always do, because my database is Excel.

Even if I check for missing values, the check shows that there are no missing values (do I use the check function correctly?):

julia> isequal(missing, PCA_Daten)
false

Do you have another clue? Thank you!
Regards,
Günter

#4

The documentation for describe could be improved, but it does contain this hint:

Missing values are filtered in the calculation of all statistics, however the column :nmissing will report the number of missing values of that variable. If the column does not allow missing values, nothing is returned. Consequently, nmissing = 0 indicates that the column allows missing values, but does not currently contain any.

I recognize that this is confusing. And have filed an issue in DataFrames here.

1 Like
#5

You can use disallowmissing!.

1 Like
#6

Thank you for this clarification! There are also no missing values, so it fits! :grinning:

#7

Thanks for the hint! As a data analyst, I’m very interested in knowing if the record contains missing values. Then I have this info and can note in the report that “x missing values are present in the record”. For my own analysis I have to know how to deal with missing values. Therefore I am pleased about every clear and meaningful Julia message! :man_office_worker:

1 Like
#8

Maybe I don’t understand the concept of function. For example, if I want to perform factor analysis, I get the same error message:

julia> FA_Modell = fit(FactorAnalysis, FA_Daten)
ERROR: MethodError: no method matching fit(::Type{FactorAnalysis}, ::Array{Union{Missing, Float64},2})
Closest candidates are:
  fit(::Type{Histogram}, ::Any...; kwargs...) at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\StatsBase\yNoh6\src\hist.jl:319
  fit(::StatisticalModel, ::Any...) at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\StatsBase\yNoh6\src\statmodels.jl:151
  fit(::Type{D<:Distribution}, ::Any) where D<:Distribution at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\Distributions\WHjOk\src\genericfit.jl:33
  ...
Stacktrace:
 [1] top-level scope at none:0

It’s the same data and if I interpret the message correctly, the method fit won’t be found, right?

What am I doing wrong, what’s wrong? I can’t imagine being the first to use these features since the last release…

I am grateful for any hint,
Günter

#10

You need to transform your X array to a type without missing (the error message indicates that fit can not handle the type Array{Union{Missing, Float64},2})). This Union type only indicates that it is possible to have missing values.
If you do not have any missing value in your data you can simply use X=coalesce.(X) to get a simple Array{Union{Missing, Float64},2}) and pass it to fit.
If you have indeed missing data, you will need to decide what to do with them (drop them, replace them with other values etc…)

2 Likes
#11

Thanks for this hint, it works! The array does not contain any missing values, but the coalesce function “removes” the missing from the array.

Thank you very much!
Regards,
Günter

#12

I still think this is all about the type of your second argument: the error message tells you that there is no method for the function fit() which takes an Array{Union{Missing, Float64},2} as an argument. The important thing to consider here is that in Julia, arrays that contain missing values are of a different type than those without missing - they are a type union of {Missing, T} (note the capitalised Missing which is the type of the value missing), where T is the type of the non-missing values.

When you read in the DataFrame, the DataFrames package automatically generates columns of the type {Missing, T} to accomodate potential missings in your data set. If you don’t have these, you can call disallowmissing! on your DataFram, which will convert the column type to just T (i.e. remove the type union). The key thing is that the second argument to fit() should be of type Array{Float64, 2, rather than a type union including Missing.

#13

IMO that’s bad API design, the method should not rely on a strict elemenet type.

Instead, it should work for all AbstractMatrix, and verify required conditions, eg that all the elements are <: Real. In the ideal case, eg with a Matrix{Float64}, this check is costless.

Cf the first two points here.

#14

Just to be clear (as I’m currently grappling with a similar API decision for an econometric estimator, for which I probably don’t want to accept missing in the first instance), you are advocating for having the method fit(model::AbstractModelType, data::AbstractMatrix) and then run a suite of checks e.g. for the presence of missing in the function body, to then throw more specific errors rather than the MethodError?

#15

Yes. I think that’s the right way of organizing the API.

In the ideal case, container types are concrete and tight, so that case should be fast, and the “check” is a no-op. But the code should with even if they aren’t, but the values themselves are valid.

Cf

julia> v = Any[1, 2, 3]
3-element Array{Any,1}:
 1
 2
 3

julia> sum(v)
6
1 Like