Pipeline : from raw data to fitted model?



Hello everyone!
I’m a novice in machine learning domain. I’m interested in basic ML pipeline from raw dataframe with missing values and so on to the trained model.
Is there any unified process of pipelining such methods as scalers, imputers, k-folds, one-hot encodings, ensembles of algorithms and grid search?
For instance, there is a class Pipeline in sklearn which does the job. Is there anything similar in Julia?



so far there is no unified process. MLJ.jl tries to solves this problem. However, it is still in an early stage. If you are familiar with scikit-learn, then you should have a look at ScikitLearn.jl which provides a similar interface and supports some algorithms that are implemented in pure Julia as well as some algorithms in Python.


If you’re a novice it may be worth noting that it’s unlikely you will find (in any language) a completely unified approach to ML pipelines that readily incorporates any sort of training model you might throw at it. I’m assuming that kind of flexibility is what you’re aiming for when you mention something like ensemble learning. There are a lot of reasons for this. The first thing that comes to mind is that not all machine learning accepts dataframes. Some data requires large multidimensional arrays (images) or combinations of multiple types of data.

This is why Julia is a good place for ML. You really need a lot of little tools that can work together without sacrificing performance. The best way to get started is just doing some tutorials to familiarize yourself with the tools. I think JuliaComputing on github has some.


If one could just take all different packages as building blocks and seamlesly feed output of one function to another there would be no problem at all : you would just use different packages as building blocks. But the problem is that there are different types, even for the simplest “categorical” variables, and when you try to smoothly pass structure from one package into another, the problems begins:
" No, i do not work with THAT type, give me another"
" No, i have my own implemented categorical variable type, please encode in this new format"
“This is DataFrame, but i want array”
“No, this is array, and i work only with dataframes!”
And so on, so you literally spend a lot of time on this reinterpretation of the data.


Yes but it’s not hopeless. In particular there is (some) hope to be converging towards a global overarching table api (Tables.jl) and eventually have specific packages (eg Clustering.jl) take a generic table as input & do the conversion internally if necessary for performance.
This is incidentally what MLJ is trying to do though at the moment it still has to do local conversions before feeding data to specific packages.

What remains true though is that one way or another there may be a fair few conversions in the course of a pipeline and inefficiencies due to looking at transposes (most data provider are n x p, most packages want p x n and incur overhead from working with the transpose), when you chain such operations you lose multiple times; maybe in the future the ideal format to convert data at the beginning of a pipeline could be inferred but it’s likely to be tricky)


You could always use Scikit as you start off but I quickly found limits to what I could do (in any ML framework I’ve used really).

A convenient interface often comes at the price of being able to use many tools. It may be the case that we solve handling of categorical data but there will always be a new type of data that we’re working on. Python, R, and Matlab have these same issues.

It’s best to just choose an approach and chug along while you gain an intuition of how these general modeling methods relate to your data because there will always be changes to pipelines and frameworks. And when you get stuck just come and ask people here for specifics. It’s a very supportive community.


Thanks for replies!