Extracting values from UnivariateFinite

So MLJ gave me, as its output, a UnivariateFinite data structure; I have:

UnivariateFinite{OrderedFactor{2}}(0=>0.937, 1=>0.0626)

Is there a way I can extract, as an array, the probabilities from this distribution (other than manually copying and pasting)?

Maybe:

julia> x = UnivariateFinite([0, 1], [0.9, 0.1])
┌ Warning: No `CategoricalValue` found from which to extract a complete pool of classes. Creating a new pool (ordered=false). You can:
│  (i) specify `pool=missing` to suppress this warning; or
│  (ii) use an existing pool by specifying `pool=c` where `c` is a `CategoricalArray`, `CategoricalValue` or CategoricalPool`.
│ In case (i) specify `ordered=true` if samples are to be `OrderedFactor`. 
└ @ MLJBase ~/.julia/packages/MLJBase/AkJde/src/univariate_finite/types.jl:262
UnivariateFinite{Multiclass{2}}(0=>0.9, 1=>0.1)

julia> x.prob_given_ref
OrderedCollections.LittleDict{UInt8, Float64, Vector{UInt8}, Vector{Float64}} with 2 entries:
  0x01 => 0.9
  0x02 => 0.1
``

@nilshg’s suggestion will work but is not recommended as this is not part of the public API.

Accessing the probabilities is described in the Working with Categorical Data section of the manual (see also this section of “Getting Started”). Here are some more examples:

julia> y = coerce(["c", "b", "a"], Multiclass)
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "c"
 "b"
 "a"

julia> d = UnivariateFinite(["a", "c"], [0.1, 0.9], pool=y)
UnivariateFinite{Multiclass{3}}(a=>0.1, c=>0.9)

julia> pdf(d, "a")
0.1

julia> pdf(d, levels(y))
3-element Vector{Float64}:
 0.1
 0.0
 0.9

And for a vector of distributions:

julia> d_vector = UnivariateFinite(["a", "b"], [0.1 0.9; 0.4 0.6], pool=missing)
2-element MLJBase.UnivariateFiniteVector{Multiclass{2}, String, UInt8, Float64}:
 UnivariateFinite{Multiclass{2}}(a=>0.1, b=>0.9)
 UnivariateFinite{Multiclass{2}}(a=>0.4, b=>0.6)

julia> broadcast(pdf, d_vector, "a")
2-element Vector{Float64}:
 0.1
 0.4

julia> pdf(d_vector, ["a", "b"])
2×2 Matrix{Float64}:
 0.1  0.9
 0.4  0.6

julia> pdf(d_vector, ["b", "a"])
2×2 Matrix{Float64}:
 0.9  0.1
 0.6  0.4

In basic MLJ workflow you shouldn’t really need the probabilities in matrix form. For example, all probabilisitic measures in MLJ (eg, LogLoss()) expect distributions for first argument, not numerical probabilities or parameters:

julia> y = coerce(rand(["a", "b"], 10), OrderedFactor)
10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "b"
 "b"
 "b"
 "a"
 "b"
 "a"
 "a"
 "a"

julia> yhat = UnivariateFinite(["a", "b"], rand(10), augment=true, pool=y)
10-element MLJBase.UnivariateFiniteVector{OrderedFactor{2}, String, UInt32, Float64}:
 UnivariateFinite{OrderedFactor{2}}(a=>0.863, b=>0.137)
 UnivariateFinite{OrderedFactor{2}}(a=>0.995, b=>0.00547)
 UnivariateFinite{OrderedFactor{2}}(a=>0.0523, b=>0.948)
 UnivariateFinite{OrderedFactor{2}}(a=>0.859, b=>0.141)
 UnivariateFinite{OrderedFactor{2}}(a=>0.216, b=>0.784)
 UnivariateFinite{OrderedFactor{2}}(a=>0.277, b=>0.723)
 UnivariateFinite{OrderedFactor{2}}(a=>0.985, b=>0.0148)
 UnivariateFinite{OrderedFactor{2}}(a=>0.206, b=>0.794)
 UnivariateFinite{OrderedFactor{2}}(a=>0.373, b=>0.627)
 UnivariateFinite{OrderedFactor{2}}(a=>0.553, b=>0.447)

julia> LogLoss()(yhat, y)
10-element Vector{Float64}:
 0.14702211036373602
 5.20855707692713
 0.053705325134450276
 1.9585781504187798
 0.24387484194022874
 1.2835655183125487
 4.210745321783671
 1.5783092501620064
 0.9856382706229556
 0.5921152165139617

Hope this helps!

2 Likes

@gideonsimpson P. S. Be good if you could add “mlj” to the tags, thanks.

Thanks for clarifying, I’ve added the tag.

@ablaom for the set-valued predictions for ConformalPrediction.jl I’m thinking about going with your approach above,

d = UnivariateFinite(["no", "yes"], [0.8, 0.05], pool=missing)

where in this case missing would indicate that the corresponding labels are not part of the prediction set. My concern is that probabilities of labels that make it into the set typically don’t sum up to 1 (see below). Is this a problem?

Edit: I just saw here that there is no strict enforcement that probabilities sum up to one, so this seems to be OK? Relatedly though, conformal prediction sets may be empty in some cases, which I don’t think is supported by CategoricalDistributions.jl?

As a reminder, conformal prediction sets look like this:

Image source: Angelopoulos and Bates (2022)

This sounds reasonable. Yes, probabilities need not sum to one.

However, I’m inclined to include the entire pool, if this can possibly work (excluded classes simply have probability zero or maybe missing which I don’t think is currently supported, but probably could be). The problem is that ordinary vectors of UnivariateFinite instances are slow to work with for large datasets. For this reason, we have UnivariateFiniteArray{..., N} <: AbstractArray{<:UnivariateFinite{...}, N}. Nornally one does not construct a UnivariateFinite array element by element but a UnivariateFiniteArray object all at once (the constructor UnivariateFinite is used for both elements and arrays. See the detailed docstring). In your proposal, this would not be possible, because each element would have a different pool.

Ultimately a decision on the form of these predictions ought to be predicated on how they will be used downstream in the workflow, for example for performance estimation. Can you describe any metric that operates on these predictions?

1 Like

@ablaom I’m afraid I haven’t really got to that stage yet, but in principal the tutorial I linked above discusses evaluation from page elven onwards.

I think figuring out downstream evaluation for these set-valued predictions is definitely a bit of a bigger project, which I won’t get to any time soon (have some other PhD-related commitments in the coming weeks). When I get back to this, perhaps it would be a good idea to have another chat to talk this through. In the meantime, I’ll open a discussion on MLJ as suggested last time we spoke.

Thanks again!