[ANN] SIRUS.jl v2: Interpretable Machine Learning via Rule Extraction


Version 2 of the SIRUS.jl package has been registered. Since version 1, some outward facing data structures have changed to be more intuitive. More importantly, the implementation has gotten simpler because a unnecessary step was removed, and the performance has improved due to all kinds of small fixes :rocket:. Benchmarks run as part of CI. A recent run is available here. As can be seen, the performance of SIRUS.jl is better than the original algorithm on most tasks; especially multi-class classification which wasn’t implemented in the R version. The performance is also similar to linear models and XGBoosts on many tasks, but this of course depends a lot on the dataset.

The strength of SIRUS is that the model generates a set of interpretable rules from random forests. In essence, the algorithm reduces the complexity of random forests to a set of interpretable rules. Next, these rules can be used for both prediction as well as explanation. This differs from common interpretability methods such as SHAP because there the complex model is still used for predictions while a simplified representation is used for interpretation. This difference may hide reliability issues or biases. For more information, see our paper at JOSS (thanks to @jbytecode, @sylvaticus, and @gdalle for their work in editing and reviewing!)

What I’m personally excited about with SIRUS.jl is that the algorithm is relatively simple and produces relatively good results; especially since version 2. The model has more flexibility than linear models because the algorithm can choose cutoff values in the data; just like decision trees and random forests, which have shown “excellent performance in settings where the number of variables is much larger than the number of observations” (Biau & Scornet, 2015).

What is most exciting is that the original algorithm (which is equivalent to SIRUS.jl in terms of performance) has by now been used in multiple domains. For example, in manufacturing black box models were not suitable since “any decision impacting the production process has long-term and heavy consequences” (Benard, 2020). The algorithm has also been applied in lithium-ion battery research (Wang et al., 2023), irrigated watersheds (Li et al., 2022), mangrove mapping (Zhao et al., 2023), malicious traffic detection (Dong et al., 2023), acute respiratory distress syndrome prediction (Wu et al., 2022), decision support systems (Valente et al., 2021), cancer subtyping (Cavinato et al., 2022), special forces selection (Huijzer et al., 2023) and probably more.

I hope that SIRUS.jl can be useful for Julia datasets too! The random forest implementation performs a bit poor on regression tasks currently and I haven’t figured out great ways to visualize the rules yet, but I think we can fix these issues and provide a very useful statistical model.

24 Likes

Thank you for your work! I’ve just recommended this package to one of the research assistants that works on these kind of subjects in our university. It is also nice to see your paper is published in Journal of Open Source Software. Good luck with it.

1 Like

That picture is beautiful. Thanks for the package. My students is now working on Rulelarning, so this is great for comparison, how good (or more probably poor) our approach will be.

2 Likes

It was a nice experience to publish there :smile: Things went extremely quick compared to other journals and communication was clear.

That’s cool that students learn about rule-based methods! I’m curious

1 Like

@rikh I made some comments in the SIRUS v1 topic that I’m not sure you looked at? I’d still be curious to know if you’ve looked at random intersection trees or iterative random forests at all, and whether SIRUS.jl might work with a custom bootstrap step?

1 Like

I did read them but didn’t understand and then forgot to answer. My apologies. I’ve answered at [ANN] SIRUS.jl v1.2: Interpretable Machine Learning via Rule Extraction - #7 by rikh.

1 Like