Creating data bins with numeric labels with cut()

This is Python code to bin data into bins with a numerical label:

pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5])

I want to do this in Julia, but it seems that Julia’s cut() function only allows for string labels:

cut(housingDF[:,:median_income], [0., 1.5, 3.0, 4.5, 6., 100.],  labels=[1, 2, 3, 4, 5])
ERROR: TypeError: in keyword argument labels, expected Union{Function, AbstractVector{var"#s54"} where var"#s54"<:AbstractString}, got a value of type Vector{Int64}

It works with string labels though:

cut(housingDF[:,:median_income], [0., 1.5, 3.0, 4.5, 6., 100.], labels=["1", "2", "3", "4", "5"])

How can I get a numerical label for a data bin in Julia?

A quick work-around is to convert to String values:
labels=string.([1, 2, 3, 4, 5]))

I want numerical labels, not String labels.

It’d help if you said where you got this cut function from. Julia itself does not provide a function called cut so I’m guessing this came from a package?

1 Like

CategoricalArrays ?

It’s from CategoricalArrays package. (The first one is from Python Pandas).

Seems not possible at the moment.

@nalimilan

IIRC I added that type assertion because inference failed without it. But maybe the compiler has improved since then and/or there are other solutions, like using ::Vector{eltype(labels)}. Feel free to try that and make a PR if it works. You’ll also have to change String to eltype(labels) on line 207 too, and of course widen the type of labels in the method signature.

1 Like

That said, you can easily get numeric values using levelcode.(cut(housingDF.median_income, [0., 1.5, 3.0, 4.5, 6., 100.])). The difference is that you’ll get a plain Vector{Int} rather than a CategoricalVector{Int}, which may or may not be what you want.

(BTW, housingDF[:,:median_income] makes an unnecessary copy of the column, you can use housingDF[!, :median_income] or housingDF.median_income instead.)

1 Like

It seems to be fine on 1.6 and 1.7 at least :slight_smile:

https://github.com/JuliaData/CategoricalArrays.jl/pull/393

Definitely not in 1.6.6.

Works now:

      Status `C:\Users\stephan\.julia\environments\v1.7\Project.toml`
  [324d7699] CategoricalArrays v0.10.6

julia> cut(0:10, 0:5:15, labels=[2.5, 7.5, 12.5])
11-element CategoricalArray{Float64,1,UInt32}:
 2.5
 2.5
 2.5
 2.5
 2.5
 7.5
 7.5
 7.5
 7.5
 7.5
 12.5
2 Likes

How do you create right-closed intervals?