Pleased to announce the Imbalance.jl package, a Julia-based toolkit featuring a wide range of established oversampling and undersampling techniques designed to tackle the class imbalance issue to improve classification model performance.
Features
-
Supports multi-class variants of the algorithms and has methods that support both nominal and continuous features
-
Supports table input/output formats as well as matrices
-
Comprehensively documented with illustrative (visual) and practical examples for using the methods as shown method documentation and examples section.
-
Provides MLJ
and TableTransforms
interfaces aside from the default pure functional interface for each method
-
Ability to wrap an arbitrary number of resampling models with a classification model from MLJ
using MLJBalancing
to function as one unified model
You can read more about the package and its features in this Julia Forem article.
Acknowledgements
Sincere thanks go to Anthony Blaom (@ablaom) for being my mentor in Google Summer of Code where this project was proposed and special thanks also go to Rik Huijzer(@rikh) for his friendliness and the binary SMOTE
implementation in Resample.jl
.
P.S. I am restricted by the number of links that the post can include for being a new user on Discourse. Julia Forem article has more links.
22 Likes
Great stuff! It’s great to see the Statistical Learning ecosystem moving forwards one step at a time with packages like this and StatisticalMeasures.jl
.
One query, would this be able to compose with a stratified sampling scheme? A project I’m working on has two (extremely imbalanced) categories of data, and within each category there are blocks of (highly correlated) entries, and so I must employ a two-level sampling scheme where first I pick blocks, undersampling from the larger category, and then within each block randomly select an entry.
1 Like
Glad to hear that you have liked it. Thank you.
One query, would this be able to compose with a stratified sampling scheme?
It composes with anything that operates on (takes and returns) X, y data (where y can also be a column in X).
A project I’m working on …
As far as I understand, the idea you are referring to here is cluster sampling. The naive RandomUndersampler
provided in Imbalance.jl
won’t help you do that, it will just delete examples randomly irrelevant of block in each class which would lead to the desired effect only if you have enough data. However, you can also try ClusterUndersampler
which is pretty much cluster sampling done to each class, where the groups in each class are decided by k-means. Otherwise, for a hacky solution, if you perform naive random undersampling but have X, y
where X
is all the data in the majority category and y
labels to what block each data point belongs to then you should be able to set the ratios
hyperparameter to achieve your desired effect.
You had me at “well-documented” But seriously, awesome work!
1 Like