New Julia machine learning package: NovaML

Hi everyone,

Recently I have humbly published a new machine learning package: NovaML.

I’ve been using Julia for several years now, and it’s become my go-to language for most of my work. However, I’ve noticed that some students and newcomers to machine learning sometimes find Julia’s ML ecosystem a bit challenging, especially if they’re coming from frameworks like Scikit-Learn.

I have started to develop NovaML in an effort to contribute to the Julia machine learning community. This package aims to leverage Julia’s powerful features, particularly functors and multiple dispatch, to create an intuitive and familiar interface for ML practitioners.

Here’s a quick example of how NovaML works:

using NovaML.LinearModel
using NovaML.Metrics

tree = DecisionTreeClassifier()

# Fit the model
tree(X, y);  

# Make predictions
ŷ = tree(X);  
accuracy_score(y, ŷ)

The module / method hierarchy is intentionally similar to Scikit-Learn to ease the transition for those familiar with it. But of course, my ultimate goal is not to make an exact copy of ScikitLearn. For example, as seen above, you can use a model struct to hold model parameters, train the model, and make predictions.

As another example, we can use pipe operations as follows, just like in base Julia:

sc = StandardScaler()
pca = PCA(n_components=2)
lr = LogisticRegression()

Xtrn |> sc |> pca |> X -> lr(X, ytrn) # fit_transform
ŷtst = Xtst |> sc |> pca |> lr # predict

NovaML is in its very early stages (currently v0.3.0). Several features have already been implemented but many more features are yet to be implemented and existing algorithms need optimization for speed and accuracy.

However, I hope that with time and community input, it can grow into a valuable addition to Julia’s machine learning ecosystem. I’d be grateful for any contributions, feedback, or suggestions.

NovaML repo: https://github.com/ilkerarslan/NovaML.jl

Documentation page: https://ilkerarslan.github.io/NovaML.jl/stable/

28 Likes

Very interesting package! Do you think this could be an analogue to R’s tidymodels, in the sense that it is a common interface for many different packages and algorithms?

For example: a unique package with several clustering algorithms with a similar interface would be a good thing (see RFC: ClusteringAPI.jl ).

I also see that you implemented the DBscan by yourself. Do you plan on adding algorithms from other packages?

2 Likes

I think this is a great idea. Indeed the ml approach in julia can seem a bit different to people coming from R and Python. Tidymodels and Scikitlearn are both pretty useful and battle tested. So any inspiration from them is probably good for the community. I will definitely play around a bit with NovaML. :slightly_smiling_face:

1 Like

This looks very cool! I’m curious if you’ve had a look at MLJ.jl. How does NovaML compare to MLJ? It seems to have similar goals.

6 Likes

There have been some efforts to come up with a common interface for machine learning models:

Do you have any thoughts on those previous efforts? I understand the desire to develop your own approach—it’s hard to create consensus on interfaces for machine learning.

Are all of the models in NovaML custom implementations or do you re-use implementations from third-party packages?

5 Likes

Thank you very much.

If I get your question correctly, I can say that my aim is to create a machine learning package from scratch in pure Julia instead of making it an interface for other packages.

It has been one month since I have published v0.0.1. My goal is to add as many algorithms as I can.

Thanks a lot. I don’t want to set expectations high. This is my humble attempt for now but who knows? May be with the contributions of community we can create something nice.

Thanks a lot. Yes, I have been using MLJ.jl for a while and it is a great package. But when I use MLJ in a project, I have to add many other packages.

My aim is to provide access to as many features as possible with a simple and unique interface.

Hi,

Currently, all of them are implementations from scratch, and I want to keep it this way. For sure, there are many points of efficiency and optimization to be improved. I also want to improve them over time.

I haven’t used LearnAPI and StatsLearnModels.jl before, but I have used MLJ a lot. It’s indeed very nice. But my observation is that newcomers find it a bit difficult.

I know there are many great packages for scientific computing in Julia. I just want to contribute to increasing the usage of Julia among students and young practitioners in various industries by showing that everything they do with packages in R and Python can be done in Julia in an easier and faster way.

I am not pretentious. I know this is not a one-person task. I just love doing it :slight_smile:

6 Likes

Hi, thanks for those clarifications @ikerarslan and the positive feedback about MLJ. I am a co-creator of MLJ and its lead developer.

I wanted to explain a little about LearnAPI.jl, which is a newer project I also lead (very part-time). There we are trying to design an all-purpose, basic ML API for the future, informed by missteps and other lessons learned, implementing interfaces for over 200 models in MLJ, over about 5 years. At some point this may replace MLJModelInterface.jl as the foundation of MLJ, but that point is some distance away.

Given your feedback, one significant feature of LearnAPI.jl may appeal to you. Unlike the interface at MLJModelInterface.jl, which is largely private to the general user, intended for convenience of implementation by new ML algorithms, we are trying to simultaneously make LearnAPI a user-friendly, public API. So, in the MLJ of the future, you might dodge the higher-level “machine” abstraction, if that is not your cup of tea. I think you’ll find that LearnAPI.jl syntax (still a WIP) is not so different from your own.

LearnAPI.jl is not yet set cast in stone, and you can give your own input at this discussion thread. A rather substantial revision of the original proposal (still a WIP) is here.

9 Likes

Hi, and many thanks for your explanations. I really like MLJ. In fact, I had been writing a book about Machine Learning with Julia until it was canceled by the publisher due to low sales volume, and I used MLJ in that book.

I had a quick look at LearnAPI.jl (sorry, it is my fault for not seeing it before) and it also seems to be user-friendly. I will definitely follow it.

I really want to have a go-to industry-standard machine learning framework in Julia. I sincerely believe it would increase the usage of Julia a lot.

1 Like

Since you’re reimplementing everything from scratch I’d be a bit worried about correctness (it’s a lot of potential for bugs), would be good to show tests against a reference implementation.

I have followed the book Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka et al. (first 10 chapters) and compared my implementations with Scikit-Learn. If I remember correctly, except 4-5 algorithms, NovaML proved to be at least as good as Scikit Learn (of course that doesn’t show it is production ready).

My codes can be seen at https://github.com/ilkerarslan/MLDLJulia.

But comparing them side by side is a very good idea. I will do that asap. Thanks.

4 Likes

Beautiful, these “traditional” ML algorithms continue to be extremely useful and I’ve always missed a Julia-native alternative of sklearn.

As a matter of the suggestions you’ve asked for, here’s two small things that bother me with your example:

First, classifying a decision tree as a linear model seems odd to me because its decision boundary is neither linear, nor a non-linear transformation thereof. You should consider moving this algorithm to another module.

Second, I’ve always valued that Julia APIs typically use immutable data structures because this leads to clearer code. In your example, the tree is mutable and the output of tree depends not only on the input arguments but also on its state. This will lead to situations where the result of some code snippet can impossibly be told just from looking at the snippet (e.g., if tree is produced by some other function). I strongly suggest a more immutable design, e.g.

using NovaML.Tree
tree = DecisionTreeClassifier(X, y; depth=2) # fit; use keyword arguments for hyper-parameters
ŷ = tree(X)
1 Like

Thanks. In fact DecisionTreeClassifier and DecisionTreeRegressor are already in the module Tree. The example is from an earlier implementation and doesn’t reflect the current situation. My mistake, sorry.

About your second note: The algorithms have the parameter “fitted”. For example after tree = DecisionTreeClassifier(), the parameter tree.fitted = false.

And after tree(X, y) the model is fitted and tree.fitted=true. If you try to use ŷ = tree(X) when tree.fitted = false, you will get a warning.

I am not sure how we can make the tree struct immutable because it has to hold the fitted parameters like coefficient, intercept, support_vector etc. At least this is the case for now. In addition, I want them to be flexible so user may modify the hyperparameters.

But I will seriously think about your suggestion. Thanks.

1 Like

So tree modifies its first argument X right?
I would propose to have 2 functions:

  1. The mutating tree!(X,y) which works like the current implementation but has a ! to signal that it mutates something
  2. tree(X, y) which essentially does tree!(copy(X), y) so creates a copy instead of mutating.

I think this interface would be much more Julian.
You can think about using another type to represent a fitted model but I don’t think it’s necessary per se.

No, tree is a callable object. It’s actually tree that gets mutated.

Sorry, @ilkerarslan, the bikeshedding has begun. It’s hard to avoid with ML libraries because lots of people are interested and have opinions. :slight_smile:

For those who are interested, I already opened an issue regarding the syntax for fitting a model:

I tend to agree with @mirkobunse that immutable data structures lead to clearer code. The LearnAPI.jl Discourse thread discussed possible designs extensively. The two main options (as I see it) for a functional, immutable design are:

  • Two separate immutable types, one to hold the hyperparameters and one that contains the full fitted object. (The full fitted object would most likely also include the hyperparameters.) So fitting and predicting would look like this:
    params = RandomForestRegressorParams(n_trees=100)
    model = fit(params, X, y)
    ŷ = predict(model, Xnew)  # Or `ŷ = model(Xnew)`, as in NovaML and Flux.
    
  • Have a FittedModel wrapper type that wraps the hyperparameters and the fitted parameters. StatsLearnModels.jl uses this approach.
2 Likes

Thanks a lot and I am really happy with the bikeshedding :slight_smile: I learn a lot from all of you.

I don’t argue that the implementation I preferred is best. I really thought a lot about design before starting to write and then picked one. Currently I picked simplicity and user friendly (at least I think so) and I didn’t see any problem up to now.

But it may be the case when implementations get more complex.

I will sincerely consider modifying the implementation to your suggestion along the way.

4 Likes