Creating data bins with numeric labels with cut()

This is Python code to bin data into bins with a numerical label:

pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5])

I want to do this in Julia, but it seems that Julia’s cut() function only allows for string labels:

cut(housingDF[:,:median_income], [0., 1.5, 3.0, 4.5, 6., 100.],  labels=[1, 2, 3, 4, 5])
ERROR: TypeError: in keyword argument labels, expected Union{Function, AbstractVector{var"#s54"} where var"#s54"<:AbstractString}, got a value of type Vector{Int64}

It works with string labels though:

cut(housingDF[:,:median_income], [0., 1.5, 3.0, 4.5, 6., 100.], labels=["1", "2", "3", "4", "5"])

How can I get a numerical label for a data bin in Julia?

A quick work-around is to convert to String values:
labels=string.([1, 2, 3, 4, 5]))

I want numerical labels, not String labels.

It’d help if you said where you got this cut function from. Julia itself does not provide a function called cut so I’m guessing this came from a package?

CategoricalArrays ?

It’s from CategoricalArrays package. (The first one is from Python Pandas).

Seems not possible at the moment.

@nalimilan

IIRC I added that type assertion because inference failed without it. But maybe the compiler has improved since then and/or there are other solutions, like using ::Vector{eltype(labels)}. Feel free to try that and make a PR if it works. You’ll also have to change String to eltype(labels) on line 207 too, and of course widen the type of labels in the method signature.

That said, you can easily get numeric values using levelcode.(cut(housingDF.median_income, [0., 1.5, 3.0, 4.5, 6., 100.])). The difference is that you’ll get a plain Vector{Int} rather than a CategoricalVector{Int}, which may or may not be what you want.

(BTW, housingDF[:,:median_income] makes an unnecessary copy of the column, you can use housingDF[!, :median_income] or housingDF.median_income instead.)

It seems to be fine on 1.6 and 1.7 at least :slight_smile:

https://github.com/JuliaData/CategoricalArrays.jl/pull/393

Definitely not in 1.6.6.

Works now:

      Status `C:\Users\stephan\.julia\environments\v1.7\Project.toml`
  [324d7699] CategoricalArrays v0.10.6

julia> cut(0:10, 0:5:15, labels=[2.5, 7.5, 12.5])
11-element CategoricalArray{Float64,1,UInt32}:
 2.5
 2.5
 2.5
 2.5
 2.5
 7.5
 7.5
 7.5
 7.5
 7.5
 12.5

How do you create right-closed intervals?