Elegant value classification following a set of rules

I have some text labels for different ranges.

For example, “A” if x between 0 and 1, “B” if x between 1 and 2 and “C” if else.

I would like to find a neat way to store such information. And then, have a function that would retrieve the label for a value of x.

I’ve tried with a dict:

d = Dict(
    "A" => 0,
    "B" => 1,
    "C" => 2,
    "D" => 3
    )

Then I managed to collect the keys and values, but something tells me this might not be the best way to do it.

Any ideas?

You need to specify the problem more precisely. Just a few questions off the top of my head:

  • Are floating point ranges allowed? Your example only has integers.
  • Can there be gaps between the ranges? Can they overlap?
  • Is there a rule for what happens at the boundaries/overlaps, or can you return an arbitrary adjacent range? For example, what should your function return given an input of exactly 1?
  • Can the boundaries be hard-coded? If not, will they be supplied sorted?

Some details:

  • Type: yes there would be floats
  • Coverage: no gaps nor overlap between ranges
  • Boundaries: What happens at boundaries is a good question. I believe there could be a parameter indicating the “direction” of evaluation. For example, let’s imagine the boundaries are 0, 0.1, 0.3 and 0.5.
    • In “ascending” order it would be: 0 <= x < 0.1; 0.1 <= x < 0.3; 0.3 <= x < 0.5 and x >= 0.5
    • In “descending” order it would be: 0 < x <= 0.1; 0.1 < x <= 0.3; 0.3 < x <= 0.5 and x > 0.5

In fact, I would like to implement a system of using rules of thumb heuristics for metrics interpretation. For example, there is the infamous grid for Cohen’s d (1999), proposing that 0-0.2 = “negligible”, 0.2-0.5 = “small”, 0.5-0.8 = “medium” and >0.8 = “large”.
I would like a system to conveniently store such grids (I thought of Dict at first) and a function taking a value and such “rule’s set” and returning the correct label.

If you store labels and breakpoints as arrays then you can immediately take advantage of the searchsortedlast() function.

struct LabelledRanges{T<:Real}
	breakpoints::AbstractArray{T}
	labels::AbstractArray
	isdesc::Bool
end

function LabelledRanges(breakpoints, labels, isdesc=true)
    if length(breakpoints) != length(labels)-1
        error("There must be exactly one more label than breakpoints.")
    elseif !issorted(breakpoints)
        error("Breakpoints must be sorted.")
    else
        LabelledRanges(breakpoints, labels, isdesc)
    end
end

getlabel(x::Real, r::LabelledRanges) = r.labels[1+searchsortedlast(r.breakpoints, x, lt=(r.isdesc ? (<=) : (<)))]

Usage:

julia> lr = LabelledRanges([0.0,0.1,0.3,0.5], ["below","A","B","C","above"], true);

julia> lr2 = LabelledRanges([0.0,0.1,0.3,0.5], ["below","A","B","C","above"], false);

julia> xx = [-1.0, 0.0, 0.01, 0.09, 0.1, 0.11, 0.29, 0.30, 0.31, 0.49, 0.50, 0.51];

julia> [x [getlabel(x,lr) for x in xx] [getlabel(x,lr2) for x in xx]]
 -1.0   "below"  "below"
  0.0   "below"  "A"
  0.01  "A"      "A"
  0.09  "A"      "A"
  0.1   "A"      "B"
  0.11  "B"      "B"
  0.29  "B"      "B"
  0.3   "B"      "C"
  0.31  "C"      "C"
  0.49  "C"      "C"
  0.5   "C"      "above"
  0.51  "above"  "above"

You may prefer to have the “direction of evaluation” to be a parameter of the getlabel() function instead of a property of LabelledRanges. Both choices are defendable I think.

2 Likes

I think you want Match.jl

EDIT: I wrote an example I thought would work but it doesn’t work. But maybe this can be solved with Match.jl

1 Like

This is called binning, and is used very frequently when discretizing data. Eg see this code for StatsBase.Histogram:

https://github.com/JuliaStats/StatsBase.jl/blob/00b5d9912aa6599f2b07c390f4c7f5a097a70dee/src/hist.jl#L175-L186

Perhaps it would be good to expose this intermediate step in StatsBase.

2 Likes

How about GitHub - BioJulia/IntervalTrees.jl: A data structure for efficient manipulation of sets of intervals ?

1 Like

Thank you all for your propositions, it is awesome to see how helping the community is!

I think I will stick with @NiclasMattsson suggestion, which is quite simple, flexible as tailorcrafted for this task and doesn’t need additional dependencies. Moreover, I suppose it’s an efficient solution as it is based on vectors…

Thanks :slight_smile:

As long as you know beforehand whether you need to deal with Int/Float32/Float64, I’d suggest storing the breakpoints all as left-open right-closed intervals, using nextfloat if necessary. That is, [L,R]=(nextfloat(L),R], (L,R)=(L,prevfloat(R)]. For integers you need to add or subtract one. That way you avoid looking up isdesc.

1 Like