AutoMLPipeline.jl makes it easy to create complexed ML pipeline structures

As part of educating myself with macro programming, I tried to apply the idea of manipulating pipeline expression so that it becomes trivial to perform feature extraction, feature selection, pipeline structure optimization by using one-liner pipeline expression. Happy to share this newly-released package:

https://github.com/IBM/AutoMLPipeline.jl.

Let me know your thoughts. I’m using mostly scikitlearn transformers and learners but will try to support R’s caret in the future. I’m surprised how powerful a simple recursive macro function can do to make the symbolic manipulation of expressions trivial.

7 Likes

Was this to make a Julia library or you saw some deficiencies in the python system?

since julia doesn’t have rich machine learning libraries, i find it useful to use scikitlearn libraries as building blocks for ml pipeline. julia serves as macro processor of these pipeline expressions before feeding it to the just-in-time compiler. scikitlearn pipeline functions do not support symbolic expressions which become hard to understand if you have more complicated pipeline workflows.

1 Like

Have you considered MLJ?

We may be interested in porting some of the functionalities you’ve developed, particularly for feature engineering / pre-processing, if you’re interested in helping out, let us know!

(and if you see possible improvements to MLJ’s approach to pipelines, let us know too, feedback is always appreciated)

5 Likes

definitely. i think there are many overlaps in our work ;). i think if we share common abstract types hierarchy, with common fit! and transform! signatures, we can trivially merge many structures in both. i can port the @pipeline expression maybe to your code?

Currently I have the following abstract types:

abstract type Machine end

abstract type Computer <: Machine end # does computation: learner and transformer

abstract type Workflow <: Machine end # types: Linear pipeline vs Feature Union pipeline

abstract type Learner <: Computer end # any model that expects input and output to learn

abstract type Transformer <: Computer end # any model that expects input

###multiple dispatch for fit and transform
function fit!(mc::Machine, input::DataFrame, output::Vector)
  error(typeof(mc),"not implemented")
end

function transform!(mc::Machine, input::DataFrame)
  error(typeof(mc),"not implemented")
end

###dynamic dispatch based Machine subtypes
function fit_transform!(mc::Machine, input::DataFrame, output::Vector=Vector())
  fit!(mc,input,output)
  transform!(mc,input)
end

We already have all this in some form, including machines, transformers, pipelines and more complex DAG of operations (see MLJBase ) but I’m interested in seeing how you chain operations, we have a syntax which approaches what you have (I think) but I’d need to look in more details at your code to see whether there are things we could inspire ourselves from.

Having a neat way to quickly specify a bunch of preprocessing steps without cumbersome syntax, generate new features etc would be great and some of what you’re doing may go in that direction afaict.

Edit: if you have the time it would be great to have your opinion on https://alan-turing-institute.github.io/MLJTutorials/getting-started/composing-models/ and let us know whether you think there are key things missing that your code would provide

1 Like

this is the entire code for the pipeline expression:

function processexpr(args)
  for ndx in eachindex(args)
    if typeof(args[ndx]) == Expr
      processexpr(args[ndx].args)
    elseif args[ndx] == :(|>)
      args[ndx] = :Pipeline
    elseif args[ndx] == :+
      args[ndx] = :ComboPipeline
    end
  end
  return args
end

macro pipeline(expr)
  lexpr = :($(esc(expr)))
  res = processexpr(lexpr.args)
  lexpr.args = res
  lexpr
end
1 Like

it’s just a hack to the parsed expression of julia. + has higher priority to |> operation so if you change + with FeatureUnion Pipeline and |> with LinearPipeline inside the macro function before the evaluation, then it should work.

@tlienart,

what about this:

function processexpr(args)
  for ndx in eachindex(args)
    if typeof(args[ndx]) == Expr
      processexpr(args[ndx].args)
    elseif args[ndx] == :+
      args[ndx] = :MyPipe
    end
  end
  return args
end

macro pipeline(expr)
  lexpr = :($(esc(expr)))
  res = processexpr(lexpr.args)
  lexpr.args = res
  lexpr
end

@macroexpand @pipeline x + hot + knn + target
# ans = :(MyPipe(x, hot, knn, target))

@ppalmes it is good to see other IBMers using Julia. I am also constributing to MLJ.jl, and it would be definitely interesting to unite efforts as opposed to creating something in parallel.

3 Likes

Yes that could be interesting, but I should be a bit clearer, our syntax for pipelines and learning networks already allows to do all this if I’m not mistaken, and is not much more complicated, additionally (and very importantly) it gives names to operations which is necessary for hyperparameter tuning.

When I initially looked at your package I thought about something a bit different: a syntax to allow the definition of new features based on an initial table. This is very common workflow and while it is already possible to do it with MLJ, I feel there could be a way (possibly a macro but not necessarily) to do this more easily for a range of simple cases.

The process as I imagine it would be to go from X -> X' where X' has some or all of X's columns as well as “derived” features. A trivial example is that X has three columns: x,y,z and that we want to get X' with six columns x, y, z, x^2, y^2, z^2, of course you may want to do this with more complicated functions / combinations, but this seems to me like a very common workflow. Having a way to neatly define how to get these derived features and feed the lot to a learning network or pipeline would be great.

In a way something similar to StatsModels.jl’s @formula though probably more geared to explicitly extend the feature matrix.

1 Like

in my case, i just created two transformers, categorical filter and numerical filter. both implements fit! and transform! so that if you place them in the pipeline, they will select columns for categories or numbers. with a linear pipeline and feature union pipeline, you can then use them as filter before feeding their output to ohe or pca. then use feature union to combine both output. as long as the pipeline which is also a transformer implements fit and transform, it is just a matter of iterating fit and transform to the pipeline which in turn calls fit and transform to its elements that are transformers too.

definitely. we need to grow the machine learning packages in julia. if we can standardize the interface such that it becomes trivial to extend it, we dont need to centralize the effort. im doing my package to learn more about the language by starting from scratch but i can use this experience to contribute to future julia packaging effort.

These interfaces are being standardized in MLJ.jl, that is what I meant. For example, the MLJModelInterafce.jl package defines the interface for learning models, and we are already consuming it in other packages. Other interfaces are being defined in the MLJ.jl umbrella, and it would be nice to have more eyes there as opposed to new interfaces like the one you are suggesting. I agree that implementations should live in separate packages, all of them implementing a common interface (OBS: currently the most agreed interface in Julia is MLJ.jls’).

3 Likes

my big realization is that after developing CombineML.jl and TSML.jl, the pipeline should be the core interface and not the models themselves. the pipeline is like an array container with elements like filters, transformers, etc. I think we should design it similar to AbstractArray thing. I looked at the MLJ design and it looks quiet complexed. It’s not easy to hack but maybe i need time to familiarize it. I really like the idea of having a symbolic expression to describe the pipeline. it is way easier to develop algos for optimal pipeline discovery.

1 Like

I disagree with this viewpoint. There should be an interface for models and an interface for pipelines. They may share verbs, but conceptually you can do much more with learning models than what you can do with pipelines. It is nice when they share verbs so that you operate seamlessly, but I would certainly miss specific things about learning models if the only API they had was that of pipelines. For example, there are many traits that one can ask about the kind of model (probabilistic, deterministic, supervised, unsupervised, and so on). A pipeline with a simple transformer does not possess such specific details, and these details are important to write generic code and extend the theory.

1 Like

it’s ok to disagree ;). i develop mine to learn julia. if anyone likes to use it, ill be glad. maybe in the future, ill see your point but right now, im very focused to pipeline optimization more than the other elements treated individually. i want it to be simple and understandable for my own consumption.

4 Likes

Fully agree that it is fine to disagree :slight_smile: We are constantly changing mind with time, it is part of the learning.

1 Like

Cool package! I’m not really an AutoMLer, but one thing I do find often useful for certain workflows is reversibility of pipelines.

I took a crack at this in my package, I wouldn’t call it a finished effort but it may be interesting(or not) to any of the other packages with their own ML pipeline things allowing for functions and their inverses to be part of a preprocesing chain.


https://caseykneale.github.io/ChemometricsTools.jl/dev/Demos/Pipelines/

One time most people run into this is “scaling” of say Y values to get their original values back out. Now for most cases that’s trivial. But - it happens in other cases I deal with as well, and errors there can lead to bad conclusions(good model is bad, bad model is okay/good). Thats why I put it in there.

Also - I really appreciate your opinions on MLJ as well. I am not here to knock it - but it is overtly complex to me and I cannot fathom exactly why that is? I think that’s why I am appreciating your package quite a bit, I can read the code so far, and it just makes sense.

Also I do agree the pipeline is the essence of the tool here (especially for you). Have you looked into transducers.jl? There’s some really inspiring stuff there that I keep wanting to leverage, but keep running away from.

1 Like

thanks for your feedback @ckneale. except for the learners, transformer operations are reversible because they store their scaling factors. i haven’t implemented it because i don’t see the need yet but it is quiet trivial. i think if you aim for explainability, you may need the original values.

i’m pretty sure you will understand the package because most of the files have similar patterns. create a structure which is a subtype of learner or transformer. define fit! and transform! functions for your new structure. then it can be part of the pipeline because it implements the two interfaces needed. pipelines, filters, transformers, and learners implement fit! and transform!. when you call a fit! in a pipeline, it iteratively calls fit! and transform! to its elements passing the output of transform! to the next element. if it it happens to be a learner, it passes both the output from the previous element and the the y-target output in the pipeline argument so that the learner can learn the mapping. it is not different from push and pop operations for queues. just simple interfaces, and each do one thing only similar to the KISS philosophy of unix which first introduced pipes in the shell. make each element do one thing and do it well. don’t worry of the other pieces. they will function by doing their own thing if they just follow exactly the action they are intended to do.

you can look at the test directory for some sample usage.

1 Like