AutoMLPipeline.jl makes it easy to create complexed ML pipeline structures

Is there a way to automatically create a visual representation (flowchart) of the pipelines?

1 Like

i’m working on it. you can use @pipelinex instead of @pipeline for the expression. i know there is a package that can translate this into tree representation by passing this expression. I think you can try TreeView.jl if you like to see the tree structure of the expression.

You can try this. Install AbstractTrees and use print_tree:

using AbstractTrees
using AutoMLPipeline

print_tree(stdout,@pipelinex a |> (b |> d) + (c |> e) |> rf)
:(Pipeline(Pipeline(a, ComboPipeline(Pipeline(b, d), Pipeline(c, e))), rf))
β”œβ”€ :Pipeline
β”œβ”€ :(Pipeline(a, ComboPipeline(Pipeline(b, d), Pipeline(c, e))))
β”‚ β”œβ”€ :Pipeline
β”‚ β”œβ”€ :a
β”‚ └─ :(ComboPipeline(Pipeline(b, d), Pipeline(c, e)))
β”‚ β”œβ”€ :ComboPipeline
β”‚ β”œβ”€ :(Pipeline(b, d))
β”‚ β”‚ β”œβ”€ :Pipeline
β”‚ β”‚ β”œβ”€ :b
β”‚ β”‚ └─ :d
β”‚ └─ :(Pipeline(c, e))
β”‚ β”œβ”€ :Pipeline
β”‚ β”œβ”€ :c
β”‚ └─ :e
└─ :rf

Just an update of the latest feature in the @pipeline call of AutoMLPipeline. Aside from |> and + operators that are for Linear and Combo Pipeline, you can now use * to act as a Selector Pipeline to pick the best ML learner. Here’s an example:

julia> pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> std)) |> (jrf * ada * sgd * tree * lsvc)
julia> crossvalidate(pcmc,X,Y,"accuracy_score",10)
(mean = 0.7276977412403225, std = 0.033181493759015454, folds = 10)

The Selector Pipeline performs internal cross-validation among the learners: jrf, ada, sgd, tree, lsvc. It will then use the best learner prediction as its final output.

The much longer typical workflow to pick the best learner and use its output will be:

julia> learners = DataFrame()
julia> for learner in [jrf,ada,sgd,tree,lsvc]
         pcmc = @pipeline disc |> ((catf |> ohe) + (numf |> std)) |> learner
         println(learner.name)
         mean,sd,_ = crossvalidate(pcmc,X,Y,"accuracy_score",10)
         global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd))
       end;
julia> @show learners;
learners = 5Γ—3 DataFrame
β”‚ Row β”‚ name                   β”‚ mean     β”‚ sd        β”‚
β”‚     β”‚ String                 β”‚ Float64  β”‚ Float64   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ rf_k2d                 β”‚ 0.684652 β”‚ 0.0334061 β”‚
β”‚ 2   β”‚ AdaBoostClassifier_1rk β”‚ 0.698086 β”‚ 0.0576059 β”‚
β”‚ 3   β”‚ SGDClassifier_2xI      β”‚ 0.715688 β”‚ 0.0452629 β”‚
β”‚ 4   β”‚ prunetree_pSa          β”‚ 0.578826 β”‚ 0.0459255 β”‚
β”‚ 5   β”‚ LinearSVC_39A          β”‚ 0.730508 β”‚ 0.0494756 β”‚

Based on these results, Linear SVC will be chosen by the user because its performance is the best (73.00%). The Selector Pipeline also used Linear SVC to achieve almost similar performance (72.7%) in an automatic manner.

2 Likes