Elegant value classification following a set of rules

DominiqueMakowski · August 30, 2018, 7:49pm

I have some text labels for different ranges.

For example, “A” if x between 0 and 1, “B” if x between 1 and 2 and “C” if else.

I would like to find a neat way to store such information. And then, have a function that would retrieve the label for a value of x.

I’ve tried with a dict:

d = Dict(
    "A" => 0,
    "B" => 1,
    "C" => 2,
    "D" => 3
    )

Then I managed to collect the keys and values, but something tells me this might not be the best way to do it.

Any ideas?

NiclasMattsson · August 30, 2018, 8:25pm

You need to specify the problem more precisely. Just a few questions off the top of my head:

Are floating point ranges allowed? Your example only has integers.
Can there be gaps between the ranges? Can they overlap?
Is there a rule for what happens at the boundaries/overlaps, or can you return an arbitrary adjacent range? For example, what should your function return given an input of exactly 1?
Can the boundaries be hard-coded? If not, will they be supplied sorted?

DominiqueMakowski · August 30, 2018, 8:48pm

Some details:

Type: yes there would be floats
Coverage: no gaps nor overlap between ranges
Boundaries: What happens at boundaries is a good question. I believe there could be a parameter indicating the “direction” of evaluation. For example, let’s imagine the boundaries are 0, 0.1, 0.3 and 0.5.
- In “ascending” order it would be: 0 <= x < 0.1; 0.1 <= x < 0.3; 0.3 <= x < 0.5 and x >= 0.5
- In “descending” order it would be: 0 < x <= 0.1; 0.1 < x <= 0.3; 0.3 < x <= 0.5 and x > 0.5

In fact, I would like to implement a system of using rules of thumb heuristics for metrics interpretation. For example, there is the infamous grid for Cohen’s d (1999), proposing that 0-0.2 = “negligible”, 0.2-0.5 = “small”, 0.5-0.8 = “medium” and >0.8 = “large”.
I would like a system to conveniently store such grids (I thought of Dict at first) and a function taking a value and such “rule’s set” and returning the correct label.

NiclasMattsson · August 30, 2018, 10:29pm

If you store labels and breakpoints as arrays then you can immediately take advantage of the searchsortedlast() function.

struct LabelledRanges{T<:Real}
	breakpoints::AbstractArray{T}
	labels::AbstractArray
	isdesc::Bool
end

function LabelledRanges(breakpoints, labels, isdesc=true)
    if length(breakpoints) != length(labels)-1
        error("There must be exactly one more label than breakpoints.")
    elseif !issorted(breakpoints)
        error("Breakpoints must be sorted.")
    else
        LabelledRanges(breakpoints, labels, isdesc)
    end
end

getlabel(x::Real, r::LabelledRanges) = r.labels[1+searchsortedlast(r.breakpoints, x, lt=(r.isdesc ? (<=) : (<)))]

Usage:

julia> lr = LabelledRanges([0.0,0.1,0.3,0.5], ["below","A","B","C","above"], true);

julia> lr2 = LabelledRanges([0.0,0.1,0.3,0.5], ["below","A","B","C","above"], false);

julia> xx = [-1.0, 0.0, 0.01, 0.09, 0.1, 0.11, 0.29, 0.30, 0.31, 0.49, 0.50, 0.51];

julia> [x [getlabel(x,lr) for x in xx] [getlabel(x,lr2) for x in xx]]
 -1.0   "below"  "below"
  0.0   "below"  "A"
  0.01  "A"      "A"
  0.09  "A"      "A"
  0.1   "A"      "B"
  0.11  "B"      "B"
  0.29  "B"      "B"
  0.3   "B"      "C"
  0.31  "C"      "C"
  0.49  "C"      "C"
  0.5   "C"      "above"
  0.51  "above"  "above"

You may prefer to have the “direction of evaluation” to be a parameter of the getlabel() function instead of a property of LabelledRanges. Both choices are defendable I think.

pdeffebach · August 30, 2018, 10:30pm

I think you want Match.jl

EDIT: I wrote an example I thought would work but it doesn’t work. But maybe this can be solved with Match.jl

Tamas_Papp · August 31, 2018, 6:26am

This is called binning, and is used very frequently when discretizing data. Eg see this code for StatsBase.Histogram:

https://github.com/JuliaStats/StatsBase.jl/blob/00b5d9912aa6599f2b07c390f4c7f5a097a70dee/src/hist.jl#L175-L186

Perhaps it would be good to expose this intermediate step in StatsBase.

yakir12 · August 31, 2018, 6:43am

How about GitHub - BioJulia/IntervalTrees.jl: A data structure for efficient manipulation of sets of intervals ?

DominiqueMakowski · August 31, 2018, 6:51am

Thank you all for your propositions, it is awesome to see how helping the community is!

I think I will stick with @NiclasMattsson suggestion, which is quite simple, flexible as tailorcrafted for this task and doesn’t need additional dependencies. Moreover, I suppose it’s an efficient solution as it is based on vectors…

Thanks

foobar_lv2 · August 31, 2018, 8:04am

As long as you know beforehand whether you need to deal with Int/Float32/Float64, I’d suggest storing the breakpoints all as left-open right-closed intervals, using nextfloat if necessary. That is, [L,R]=(nextfloat(L),R], (L,R)=(L,prevfloat(R)]. For integers you need to add or subtract one. That way you avoid looking up isdesc.

Topic		Replies	Views
Range checking / bounds (of a variable) checking General Usage	5	3926	August 23, 2018
Is there a dictionary-based data structure where the keys are disjoint ranges, and indexing with an integer looks up the matching range? General Usage question	3	545	April 24, 2020
Mapping a value from one range to another General Usage question	11	2281	August 7, 2022
UnitRange behavior General Usage	10	1071	November 23, 2018
Strange inclusive range issue New to Julia	7	411	July 6, 2021

Elegant value classification following a set of rules

Related topics