Extracting values from UnivariateFinite

gideonsimpson · June 12, 2021, 1:28pm

So MLJ gave me, as its output, a UnivariateFinite data structure; I have:

UnivariateFinite{OrderedFactor{2}}(0=>0.937, 1=>0.0626)

Is there a way I can extract, as an array, the probabilities from this distribution (other than manually copying and pasting)?

nilshg · June 13, 2021, 6:52am

Maybe:

julia> x = UnivariateFinite([0, 1], [0.9, 0.1])
┌ Warning: No `CategoricalValue` found from which to extract a complete pool of classes. Creating a new pool (ordered=false). You can:
│  (i) specify `pool=missing` to suppress this warning; or
│  (ii) use an existing pool by specifying `pool=c` where `c` is a `CategoricalArray`, `CategoricalValue` or CategoricalPool`.
│ In case (i) specify `ordered=true` if samples are to be `OrderedFactor`. 
└ @ MLJBase ~/.julia/packages/MLJBase/AkJde/src/univariate_finite/types.jl:262
UnivariateFinite{Multiclass{2}}(0=>0.9, 1=>0.1)

julia> x.prob_given_ref
OrderedCollections.LittleDict{UInt8, Float64, Vector{UInt8}, Vector{Float64}} with 2 entries:
  0x01 => 0.9
  0x02 => 0.1
``

ablaom · June 14, 2021, 8:50pm

@nilshg’s suggestion will work but is not recommended as this is not part of the public API.

Accessing the probabilities is described in the Working with Categorical Data section of the manual (see also this section of “Getting Started”). Here are some more examples:

julia> y = coerce(["c", "b", "a"], Multiclass)
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "c"
 "b"
 "a"

julia> d = UnivariateFinite(["a", "c"], [0.1, 0.9], pool=y)
UnivariateFinite{Multiclass{3}}(a=>0.1, c=>0.9)

julia> pdf(d, "a")
0.1

julia> pdf(d, levels(y))
3-element Vector{Float64}:
 0.1
 0.0
 0.9

And for a vector of distributions:

julia> d_vector = UnivariateFinite(["a", "b"], [0.1 0.9; 0.4 0.6], pool=missing)
2-element MLJBase.UnivariateFiniteVector{Multiclass{2}, String, UInt8, Float64}:
 UnivariateFinite{Multiclass{2}}(a=>0.1, b=>0.9)
 UnivariateFinite{Multiclass{2}}(a=>0.4, b=>0.6)

julia> broadcast(pdf, d_vector, "a")
2-element Vector{Float64}:
 0.1
 0.4

julia> pdf(d_vector, ["a", "b"])
2×2 Matrix{Float64}:
 0.1  0.9
 0.4  0.6

julia> pdf(d_vector, ["b", "a"])
2×2 Matrix{Float64}:
 0.9  0.1
 0.6  0.4

In basic MLJ workflow you shouldn’t really need the probabilities in matrix form. For example, all probabilisitic measures in MLJ (eg, LogLoss()) expect distributions for first argument, not numerical probabilities or parameters:

julia> y = coerce(rand(["a", "b"], 10), OrderedFactor)
10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "b"
 "b"
 "b"
 "a"
 "b"
 "a"
 "a"
 "a"

julia> yhat = UnivariateFinite(["a", "b"], rand(10), augment=true, pool=y)
10-element MLJBase.UnivariateFiniteVector{OrderedFactor{2}, String, UInt32, Float64}:
 UnivariateFinite{OrderedFactor{2}}(a=>0.863, b=>0.137)
 UnivariateFinite{OrderedFactor{2}}(a=>0.995, b=>0.00547)
 UnivariateFinite{OrderedFactor{2}}(a=>0.0523, b=>0.948)
 UnivariateFinite{OrderedFactor{2}}(a=>0.859, b=>0.141)
 UnivariateFinite{OrderedFactor{2}}(a=>0.216, b=>0.784)
 UnivariateFinite{OrderedFactor{2}}(a=>0.277, b=>0.723)
 UnivariateFinite{OrderedFactor{2}}(a=>0.985, b=>0.0148)
 UnivariateFinite{OrderedFactor{2}}(a=>0.206, b=>0.794)
 UnivariateFinite{OrderedFactor{2}}(a=>0.373, b=>0.627)
 UnivariateFinite{OrderedFactor{2}}(a=>0.553, b=>0.447)

julia> LogLoss()(yhat, y)
10-element Vector{Float64}:
 0.14702211036373602
 5.20855707692713
 0.053705325134450276
 1.9585781504187798
 0.24387484194022874
 1.2835655183125487
 4.210745321783671
 1.5783092501620064
 0.9856382706229556
 0.5921152165139617

Hope this helps!

ablaom · June 14, 2021, 9:01pm

@gideonsimpson P. S. Be good if you could add “mlj” to the tags, thanks.

nilshg · June 15, 2021, 5:19am

Thanks for clarifying, I’ve added the tag.

pat-alt · October 25, 2022, 5:57am

@ablaom for the set-valued predictions for ConformalPrediction.jl I’m thinking about going with your approach above,

d = UnivariateFinite(["no", "yes"], [0.8, 0.05], pool=missing)

where in this case missing would indicate that the corresponding labels are not part of the prediction set. My concern is that probabilities of labels that make it into the set typically don’t sum up to 1 (see below). Is this a problem?

Edit: I just saw here that there is no strict enforcement that probabilities sum up to one, so this seems to be OK? Relatedly though, conformal prediction sets may be empty in some cases, which I don’t think is supported by CategoricalDistributions.jl?

As a reminder, conformal prediction sets look like this:

Image source: Angelopoulos and Bates (2022)

ablaom · October 25, 2022, 11:18pm

This sounds reasonable. Yes, probabilities need not sum to one.

However, I’m inclined to include the entire pool, if this can possibly work (excluded classes simply have probability zero or maybe missing which I don’t think is currently supported, but probably could be). The problem is that ordinary vectors of UnivariateFinite instances are slow to work with for large datasets. For this reason, we have UnivariateFiniteArray{..., N} <: AbstractArray{<:UnivariateFinite{...}, N}. Nornally one does not construct a UnivariateFinite array element by element but a UnivariateFiniteArray object all at once (the constructor UnivariateFinite is used for both elements and arrays. See the detailed docstring). In your proposal, this would not be possible, because each element would have a different pool.

Ultimately a decision on the form of these predictions ought to be predicated on how they will be used downstream in the workflow, for example for performance estimation. Can you describe any metric that operates on these predictions?

pat-alt · October 27, 2022, 8:08am

@ablaom I’m afraid I haven’t really got to that stage yet, but in principal the tutorial I linked above discusses evaluation from page elven onwards.

I think figuring out downstream evaluation for these set-valued predictions is definitely a bit of a bigger project, which I won’t get to any time soon (have some other PhD-related commitments in the coming weeks). When I get back to this, perhaps it would be a good idea to have another chat to talk this through. In the meantime, I’ll open a discussion on MLJ as suggested last time we spoke.

Thanks again!

math4mad · July 7, 2023, 6:51pm

Thanks for your working.
Finally, finding way to plot decision boundary of contour

# from Probabilistic Machine Learning  fig2.13
probs=predict(mach, x_test)|>Array
probs_res=broadcast(pdf, probs, "versicolor").|>(d->round(d,digits=2))|>d->reshape(d,nums,nums)

Topic		Replies	Views
Help me convert a weird data type into booleans! New to Julia question , dataframes , mlj	6	611	January 31, 2023
Constructing Vector{Distribution{Univariate, Continuous}} in Distributions.jl New to Julia	2	42	January 9, 2025
Sampling Multiple samples from NormalInverseGamma distribution Statistics	2	675	April 27, 2018
Distributions.jl extension Statistics distributions	2	502	November 20, 2020
How to add/subtract probability distributions? New to Julia physics , distributions	25	3355	February 7, 2022

Extracting values from UnivariateFinite

Related topics