Hello,
I’m working on the PCA from the MultivariateStats package and get the following error message (I’m working with ATOM):
julia> PCA_Modell = fit(PCA, PCA_Daten)
ERROR: MethodError: no method matching fit(::Type{PCA}, ::Array{Union{Missing, Float64},2})
Closest candidates are:
fit(::Type{Histogram}, ::Any...; kwargs...) at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\StatsBase\yNoh6\src\hist.jl:319
fit(::StatisticalModel, ::Any...) at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\StatsBase\yNoh6\src\statmodels.jl:151
fit(::Type{D<:Distribution}, ::Any) where D<:Distribution at C:\Users\guent\.juliapro\JuliaPro_v1.0.3.1\packages\Distributions\WHjOk\src\genericfit.jl:33
...
Stacktrace:
[1] top-level scope at none:0
I’m trying to recreate an example that I did with R. In R I have to specify the expected number of components. If this is not the case here (I refer to the doc). The data I use comes from a DataFrame which I have converted into an array:
The documentation for describe could be improved, but it does contain this hint:
Missing values are filtered in the calculation of all statistics, however the column :nmissing will report the number of missing values of that variable. If the column does not allow missing values, nothing is returned. Consequently, nmissing = 0 indicates that the column allows missing values, but does not currently contain any.
I recognize that this is confusing. And have filed an issue in DataFrames here.
Thanks for the hint! As a data analyst, I’m very interested in knowing if the record contains missing values. Then I have this info and can note in the report that “x missing values are present in the record”. For my own analysis I have to know how to deal with missing values. Therefore I am pleased about every clear and meaningful Julia message!
You need to transform your X array to a type without missing (the error message indicates that fit can not handle the type Array{Union{Missing, Float64},2})). This Union type only indicates that it is possible to have missing values.
If you do not have any missing value in your data you can simply use X=coalesce.(X) to get a simple Array{Union{Missing, Float64},2}) and pass it to fit.
If you have indeed missing data, you will need to decide what to do with them (drop them, replace them with other values etc…)
I still think this is all about the type of your second argument: the error message tells you that there is no method for the function fit() which takes an Array{Union{Missing, Float64},2} as an argument. The important thing to consider here is that in Julia, arrays that contain missing values are of a different type than those without missing - they are a type union of {Missing, T} (note the capitalised Missing which is the type of the value missing), where T is the type of the non-missing values.
When you read in the DataFrame, the DataFrames package automatically generates columns of the type {Missing, T} to accomodate potential missings in your data set. If you don’t have these, you can call disallowmissing! on your DataFram, which will convert the column type to just T (i.e. remove the type union). The key thing is that the second argument to fit() should be of type Array{Float64, 2 , rather than a type union including Missing.
IMO that’s bad API design, the method should not rely on a strict elemenet type.
Instead, it should work for all AbstractMatrix, and verify required conditions, eg that all the elements are <: Real. In the ideal case, eg with a Matrix{Float64}, this check is costless.
Just to be clear (as I’m currently grappling with a similar API decision for an econometric estimator, for which I probably don’t want to accept missing in the first instance), you are advocating for having the method fit(model::AbstractModelType, data::AbstractMatrix) and then run a suite of checks e.g. for the presence of missing in the function body, to then throw more specific errors rather than the MethodError?
Yes. I think that’s the right way of organizing the API.
In the ideal case, container types are concrete and tight, so that case should be fast, and the “check” is a no-op. But the code should with even if they aren’t, but the values themselves are valid.