How do I compute a multiclass F-score using MLJ?

I received the above question in an email.

1 Like

Here is my answer:

using MLJ # or MLJBase is enough

julia> y1 = categorical(rand("abc", 10)) # or `coerce(rand("abc", 10), Multiclass))`
10-element CategoricalArray{Char,1,UInt32}:
 'b'
 'a'
 'a'
 'a'
 'c'
 'b'
 'c'
 'c'
 'c'
 'b'

julia> y2 = categorical(rand("abc", 10))
10-element CategoricalArray{Char,1,UInt32}:
 'b'
 'c'
 'c'
 'c'
 'a'
 'b'
 'b'
 'b'
 'a'
 'c'

julia> measures("FScore")
2-element Vector{NamedTuple{(:name, :instances, :human_name, :target_scitype, :supports_weights, :supports_class_weights, :prediction_type, :orientation, :reports_each_observation, :aggregation, :is_feature_dependent, :docstring, :distribution_type), T} where T<:Tuple}:
 (name = FScore, instances = [f1score], ...)
 (name = MulticlassFScore, instances = [macro_f1score, micro_f1score, multiclass_f1score], ...)

julia> m = MulticlassFScore()
MulticlassFScore(β = 1.0,average = MLJBase.MacroAvg(),return_type = LittleDict)

julia> m(y1, y2)
0.19047619047619047

julia> macro_f1score(y1, y2)
0.19047619047619047

Here’s the docstring:

help?> MulticlassFScore
search: MulticlassFScore multiclass_f1score MulticlassFalseDiscoveryRate

  MulticlassFScore(; β=1.0, average=macro_avg, return_type=LittleDict)

  One-parameter generalization, F_β, of the F-measure or balanced F-score for multiclass
  observations.

  MulticlassFScore()(ŷ, y)
  MulticlassFScore()(ŷ, y, class_w)

  Evaluate the default score on multiclass observations, ŷ, given ground truth values, y.
  Options for average are: no_avg, macro_avg (default) and micro_avg. Options for
  return_type, applying in the no_avg case, are: LittleDict (default) or Vector. An
  optional AbstractDict, denoted class_w above, keyed on levels(y), specifies class
  weights. It applies if average=macro_avg or average=no_avg.

  For more information, run info(MulticlassFScore).
3 Likes

To get the weighted F1 score instead of micro or macro averages, we use the class_w with class_w[i] being the proportion of the i-th class in the dataset, right? Or is there a parameter for that already e.g. like micro_avg

I believe micro averages (average=micro_avg) take into account the relative proportions of the classes, in the sense that the score can be high, even if performance for a rare class is poor. So no need to manually compute and pass class_w for that (only average=macro_avg and average=no_avg support user-specified class weights). Is that what you are after?

@ven-k may want to clarify.

The implementation details are approximately here.

1 Like

If you want to access the per class weighted FScore in the example above by @ablaom,

julia> m_no_avg = MulticlassFScore(average=no_avg)
MulticlassFScore(β = 1.0,average = MLJBase.NoAvg(),return_type = LittleDict)

julia> class_w = LittleDict('a' => 0.1, 'b' => 0.4, 'c' => 0.5)
LittleDict{Char, Float64, Vector{Char}, Vector{Float64}} with 3 entries:
  'a' => 0.1
  'b' => 0.4
  'c' => 0.5

julia> f1_no_avg = m_no_avg(y1, y2, class_w)
LittleDict{String, Float64, Vector{String}, Vector{Float64}} with 3 entries:
  "a" => 0.06
  "b" => 0.0
  "c" => 0.0

julia> m(y1, y2, class_w)  # By default, macro_avg is used and notice that its same as mean(f1_no_avg)
0.02

MLJBase considers this family of multiclass scores to be of:

  • micro_avg → M(ulticlass)TP, MTN… are computed across classes and then the MRecall, MFScore… are computed
  • macro_avg → MTP, MTN… are computed per class and MRecall, MFScore… are also computed per class and averaged value is returned
  • no_avg → MTP, MTN… are computed per class and per class MRecall, MFScore are returned

Further the weighted scores are supported for both average=macro_avg and average=no_avg to suit our broader design of package and stay flexible (most scores return vector, as it’s convenient to check per-class values and apply aggregation if necessary).

As class info is not available with micro_avg, it’s promoted to macro_avg whenever weights are passed.

2 Likes