My issues aren’t with confederated code bases, I’m all fine and good with that. That structure is highly amenable to Julia in general and gives some breathing room. I am not convinced we can get all the advantages we might want going this route 100%, but that’s a debatable nuance not worth poking…
I’m not here to rain on MLJ. I misdirected my troubles on MLJ and that wasn’t fair. I’m sorry.
But as promised I am writing a list up of my gripes with sklearn, please feel free to shoot down the grammar, logical errors, and formatting, I don’t have a lot of time. Here’s some for starters:
- Crossvalidation quickly divulges into memorizing an API and basically accessing tons of weird custom kwargs in dicts. Nested cross val’s are sloppy, and unintuitive. Leading lay people to avoid them and improperly validate model parameters(I see it all the time). This could easily be improved in julia, I favor the iterator approach, but hell, do as you please.
My worries with MLJ is what we see is lots of macro calls that lock code/model building down into inflexible nongeneric paradigms. Sure it’s snappy, but for people who know what they want, they look at something like that and end up trying to hack around it, or just going rogue after realizing an API won’t satisfy their needs without serious effort…
On the flipside macros could be used for model inspection! You ever leave work on Friday to train a model over the weekend, only to have it explode 8hrs in? What if you could evaluate a chunk of code which didn’t do any, or minimal parts of the math, just sort out indexing bugs and things. Trivial use case, but there are times where this sort of thing could be helpful, and AFAIK can’t be done in python.
- Unification of tools… I could go on for days as to how they implemented things and how it defies the elegance of the obvious pieces of theory in the field. Here’s a small example: sklearn.cluster.MiniBatchKMeans,sklearn.cluster.KMeans, well why not SGD KMeans? What is different here is one is an online learning algorithm. The notions between offline and online algorithms could be cleanly broken out and code reuse could happen(not neccessairily here but in other cases). Well, say we want to do PLS regression, performing online PLS involves a subset of the PLS calculation. Some algorithms for other models surely are similar. But why make a bunch of separate methods for it? Worse, what if they were in different packages, one had an error, the other didn’t? Confusing for an end user if maintenance debacles happen.
I admit, this is somewhat OCD. Who cares how sloppy a code base is, or how far it sprawls unless you’re trying to sort out a bug, or vet it for industrial/certified usage(some companies won’t use packages unless they have been internally vetted)? For people just hacking away it’s no big deal if everything goes as planned. But julia code can be beautiful, and it can elegantly link theory to practice. Python can’t really have that, we really need to tap into that in my opinion…
A lot of models are simple compositions of transformations and other models. What you see often times in SKlearn is these aren’t handled in that way whatsoever. Many models could be represented beautifully as a DAG, and treated that way internally. This is valuable for a lot of industries/workflows, whether they know it now or not… Many of those methods aren’t even in SKLearn last I checked - probably for this reason…
- 2 or N language problem: scikit learn is frequently calling down to C/C++ to perform operations. Who cares? I do, I care that I have to track down a cpp file, read the code, pull open a C++ editor
and debug machine learning tweaks in C++, its laborious and error prone because I don’t write C++ for a living. I also care because you end up with multiple functions doing effectively the same things because there isn’t continuity in the codebase. I don’t have a specific example because I haven’t grocked SKLearns code in a while because… have you looked at it it’s 10s of thousands of lines?
Leveraging sklearn in Julia sure, it’s nice, but it’s a bandaid. Besides, so many of these algorithms when written in julia are differentiable. Meaning I can hack away in flux, to add penalties to base methods. I had a bayes method running at my last job doing exactly this. Why? because it worked well, was a super simple tweak that improved performance, but - I had to write the algorithm entirely in Julia to get that benefit or manually write backprop rules into someone elses code(not gonna happen).
-
Parallelism - SKLearn has some methods that can leverage multicore. Others that cannot. Sucks plain and simple. Julia can embed parallelism without doing shifty things! We gotta showcase this.
-
Modularity - The SKLearn codebase is a pile of python glue. It’s doing too many things. The data ops should be separate from the modelling things, the tuning ops should be separate as well. Especially now that tuning is widely considered a model. Yes that’s taste, but, by enforcing this, typically you get better separations of concerns and more generic code.
So yea that’s some off the cuff rambling about superficial issues I have with SKLearn. I think anyone who has used it for anything that wasn’t a kaggle tutorial has had to hack away against it’s internals it to get something done.