Version 2 of the
SIRUS.jl
package has been registered. Since version 1, some outward facing data structures have changed to be more intuitive. More importantly, the implementation has gotten simpler because a unnecessary step was removed, and the performance has improved due to all kinds of small fixes . Benchmarks run as part of CI. A recent run is available here. As can be seen, the performance of SIRUS.jl is better than the original algorithm on most tasks; especially multi-class classification which wasn’t implemented in the R version. The performance is also similar to linear models and XGBoosts on many tasks, but this of course depends a lot on the dataset.
The strength of SIRUS is that the model generates a set of interpretable rules from random forests. In essence, the algorithm reduces the complexity of random forests to a set of interpretable rules. Next, these rules can be used for both prediction as well as explanation. This differs from common interpretability methods such as SHAP because there the complex model is still used for predictions while a simplified representation is used for interpretation. This difference may hide reliability issues or biases. For more information, see our paper at JOSS (thanks to @jbytecode, @sylvaticus, and @gdalle for their work in editing and reviewing!)
What I’m personally excited about with SIRUS.jl is that the algorithm is relatively simple and produces relatively good results; especially since version 2. The model has more flexibility than linear models because the algorithm can choose cutoff values in the data; just like decision trees and random forests, which have shown “excellent performance in settings where the number of variables is much larger than the number of observations” (Biau & Scornet, 2015).
What is most exciting is that the original algorithm (which is equivalent to SIRUS.jl in terms of performance) has by now been used in multiple domains. For example, in manufacturing black box models were not suitable since “any decision impacting the production process has long-term and heavy consequences” (Benard, 2020). The algorithm has also been applied in lithium-ion battery research (Wang et al., 2023), irrigated watersheds (Li et al., 2022), mangrove mapping (Zhao et al., 2023), malicious traffic detection (Dong et al., 2023), acute respiratory distress syndrome prediction (Wu et al., 2022), decision support systems (Valente et al., 2021), cancer subtyping (Cavinato et al., 2022), special forces selection (Huijzer et al., 2023) and probably more.
I hope that SIRUS.jl
can be useful for Julia datasets too! The random forest implementation performs a bit poor on regression tasks currently and I haven’t figured out great ways to visualize the rules yet, but I think we can fix these issues and provide a very useful statistical model.