ANN: NaiveNASlib, NaiveNASflux, NaiveGAflux: Tools for network surgery and neural architecture search

Hi All,

I have made a trio of packages for neural architecture search, mostly because I fell in love with the language and couldn’t think of anything more useful to do which wasn’t already done (and maybe a little bit because I’m oddly amused by the thought of a computer doing crazy stuff while I sleep).

The first member is

Despite NAS being part of the name, it does not do any architecture search. In fact, it does not even do neural networks (by itself). What it does do (and does quite well I must say) is figuring out how to modify the structure of a neural network so that all matrix dimensions stay consistent with each other. It also (to the largest extent possible) ensures that output neurons from one layer stay connected to the same input neurons of the next layer as before a size change.

This does perhaps not sound very impressive when thinking about simple architectures which just stack layers on top of each other. However, things get out of hand quickly when one adds elementwise operations (e.g as used in residual connections) and concatenations (e.g as used in inception modules) in the mix.

Here is the list of supported operations:

Change the input/output size of vertices
Parameter pruning/insertion (policy excluded)
Add vertices to the graph
Remove vertices from the graph
Add edges to a vertex
Remove edges to a vertex

These operations are sometimes useful even outside of the NAS context, for example when doing transfer learning or network compression (although I guess one can argue the these are some form of lightweight NAS).

Perhaps interesting to note is that all the python libraries I peeked at seemed to do this by recursively traversing the computation graph, propagating the size changes to each visited vertex. Maybe it is because I suck at graphs (and at reading python code, f**n write-only-language), but (despite some pretty deep level of sophistication) I could not get this approach to survive more than a couple of generations in the hands of an aggressive genetic algorithm.

Luckily enough my new friend JuMP (and Cbc) came to the rescue and after formulating the size change and pruning task as a MILP problem I did not see any more failures. As a bonus, it allowed me to radically shrink the code base as well. I have not seen this approach being used anywhere else so I guess this might be the only novel contribution of this work.

Next member is

This package is simply dressing up the layers defined in Flux to use with the mutation capabilities of NaiveNASlib. It knows things like how to remove/add inputs to each of the layers defined in Flux through what is just a manual annotation of what dimensions of the weight arrays are input and output. It also does some generic but still implementation dependent stuff like hooking into gradients to compute pruning metrics.

Last member is

This one actually does neural architecture search (despite ironically not having NAS in the name) including the obligatory model = fit(data) gimmick. It does so through genetic optimization (a.k.a the bogosort of optimization) so it might not be the best thing one can make out of the other two packages.

I originally intended to use this package only to do usability and mutation testing (yeah I know this usually means something else) of the other two and didn’t even think I would release it. After a while it dawned to me that despite my efforts to make NaiveNASlib very easy to use (and it really is, please try it out!), there is still a significant amount of code needed to do the actual architecture search, even for something as simple as genetic optimization. As I figured people might be hesistant to try the other libs if there is not a single working example of how to actually use it for what it is intended to do, I decided to release it (with a disclaimer that it is only meant as an example).

It shall also be noted that while the first two are released as version 1.x.x, NaiveGAflux is preliminary. Most notably I have not tried to use it for anything else than image classification and that is the only thing supported by the fit method. Suggestions on example problems in other domains are more than welcome, especially PRs!

In its current state, NaiveGAflux is probably more suited to people who want to mess around with genetic algorithms rather than people who want to outsource the model building to a computer.

Oh well, thats it. Let the tickets flow or the tumbleweed roll!!

QnA (Questions nobody Asked):

Why are the packages prefixed with the word Naive?
- Well, given that these is my first Julia packages there are probably more than a couple newbie mistakes when it comes to the design. I wanted to keep the names NAS.jl and/or AutoFlux.jl free for when someone like Mike Innes gets around to hack everthing you need into the compiler.

What kind of accuracy can one expect from NaiveGAflux?
- I have not made any large effort to tune the (hyper-)hyper-parameters, nor have I investigated how good it is at converging. It seems to be able to get about 10% test error rate on CIFAR10 after less than 100 generations (where one generation is trained on 400 batches with 32 examples per batch) using a population of 50 models. I have not investigated how reproducible this is or if it will get a better accuracy if allowed to run longer.

Is it really feasible to use the MILP formulation?
- I was indeed worried that the method would be impractical as the solver could randomly get stuck on “hard architectures”, but after som light experiments with very “transparent” architectures of up to 10000 layers my worries where relieved. I have not seen it take more than a fraction of a second in “normal” use which should be insignificant in comparison to the time it takes to train the model. Should problems surface I guess as a last resort one could just set a time limit for the solver and treat a time out in the same manner as infeasible is treated.

Are you aware that many cool state of the art NAS methods (e.g. DARTS, NEAT, DNW) don’t really use the operations provided by NaiveNASlib?
- Yes. Yet another reason why I didn’t want to call it NASlib.jl I guess.

Does it do Neural ODEs or more advanced control structures?
- Not easily as of now unfortunately. There is an issue with some suggestions posted for NaiveNASlib. Feel free to submit a PR for it :slight_smile:


Very interesting, thanks for sharing!

A potential usecase came to my mind, when designing convolutional autoencoders, i.e., models that somehow output the input, one often face problems with finding parameters for kernel size, padding, stride and dilation that results in the output being of the same size as the input. One example:

The encoder reduces the size of the input by, e.g., using strided convolution. When doing the reverse in the decoder, the result is not always of the same size as where we started due to quantization. For large autoencoders with a bottleneck, it’s often a hassle to find parameters that work together.

Could your packages, like NaiveNASflux.jl, help me find parameters that work and are in some sense close to a set of parameters I have specified?

Thanks for showing interest!

There is currently very little w.r.t sizes of the feature maps unfortunately as the packages try very hard to be generic towards all types of layers.

It might be warranted to add some utilities for this given that the same type of issue with convolutions appear even outside the context of autoencoders. For instance, the image classifier example contains alot code which is basically single purpose solutions to ensure that different-sized feature maps are not concatenated or elementwise added. There is also a heavy reliance on “same” padding to keep feature maps aligned in size.

It would in principle be quite straighforward to add something which e.g. computes the output shape of a vertex given an input shape to some other vertex in the graph.

I think this has the fundamental disadvantage that it is still not easy to write code around this which has a reasonable chance to “understand” what is going on. For example how would a program know whether a size change from e.g. 8x8 to 4x4 is due to subsampling or just truncation.

Another thing is that it is currently allowed to use “raw” functions as the computation performed by a vertex, and for that case it would be a bit awkward to “query” the function about what it does with a certain input size.

One can ofc just run the function with a certain input and see what comes out, but that option exists already, for example like this:

activations = Dict{AbstractVertex, Any}(v1 => inputarray)
output!(v2, activations) # activations contain the output for each vertex between v1 and v2 after this 

W.r.t finding parameters that work and are in some sense close to a set of specified parameters: I guess that if this is a step only performed when first constructing the architecture, one could just build the model incrementally, using some heuristics to e.g. count the number of truncated elements and the “total subsampling factor” in the decoder and then just compensate for truncations with padding and subsamplig with dilation in the decoder.

To enable the above in the context that one wants to evolve a (half-)trained model by mutating things like stride, padding and dilation: Yikes, I don’t really have a good answer for that now :slight_smile:

Ugh, sorry for wall of text…

NaiveNASlib and NaiveNASflux 2.0 is now released.

Main change is that alignment of parameter sizes and neurons is now done in a single step, compared to previous versions where the procedure was to first align sizes, and after that perform a second step for selecting/inserting neurons. This leads to a higher success rate is it previously was possible for sizes to align in a way where it was not possible to find a feasible solution to the neuron alignment problem.

Another big change is that NaiveNASlib now fully supports Functors.jl. NaiveNASflux 1.x supported Functors, but it had the somewhat unfortunate side effect that it also mutated the input model when using Functors.fmap. This is now changed so that fmapping models always keeps the original model intact and should be safe for general usage.

Given that I suspect that the user base is fairly small, I decided to make a number of other cosmetic breaking changes to make the packages more idiomatic, such as consistently adding bangs to mutating functions and putting arguments which are functions first to enable do syntax.

Here is a short(ish) example of how to use the packages to reduce the number of parameters of a pretrained resnet from the onnx model zoo in a prune-finetune loop. As I couldn’t be bothered to download the whole imagenet dataset it’s using imagenette provided by FastAI.

using NaiveNASflux, ONNXNaiveNASflux, Flux

using NaiveNASflux: defaultutility
using ONNXNaiveNASflux: create_vertex_default

function runit(maxprune=0.2)
    # Load the model and attach an activation based neuron utility measure (ActivationContribution). 
    model = load("resnet18-v2-7.onnx"; vfun = (args...) -> create_vertex_default(args...; layerfun=ActivationContribution))

    # Imagenette has only 10 labels, so we need to reduce the size of the output layer
    Δnout!(model[end] => -990) do v
        utility = fill(-1, nout(v))
       # This is the subset of labels used in imagenette, lets keep them for better performance out of the box
        utility[[1, 218, 483, 492, 498, 567, 570, 572, 575, 702]] .= 10
        return utility

    # This code is a bit lengthy and not included here for brevity reasons
    # I needed to apply a bit of augmentation or else the model overfitted very quickly
    trainbatchsize = 128
    traindl = trainiter(batchsize=trainbatchsize) |> GpuIter
    validdl = validiter(batchsize=128) |> GpuIter
    model = gpu(model)

    # Check that we get something similar to the advertised accuracy
    @info "Accuracy before training: $(accuracy(model, validdl))"
    for e in 1:20
        @info "Begin epoch $e"
        Flux.train!((x,y) -> Flux.logitcrossentropy(model(x), y), params(model), traindl, Momentum(0.01 / 4))

        acc = accuracy(model, validdl)
        @info "  Accuracy: $acc"

        # Prune the model if accuracy is good enough
        if acc > 0.90
           # Move to CPU for pruning as CuArrays don't like to be indexed so much
            cpumodel = cpu(model)
            # First layer is a batchnorm which is tied to the input size, so we can't prune it
            # We obviously don't want to prune the output size of the last layer either 
            valid_vertices = cpumodel[3:end-1]
            nparams_pre = mapreduce(length, +, params(cpumodel))
            # This is the pruning step. We ask NaiveNASlib to maximize the utility which means 
            # it will prune neurons with negative utility.
            Δsize!(valid_vertices) do v
                util = defaultutility(v)   
                length(util) > 1 || return util
                util .- quantile(util, maxprune)
            nparams_post = mapreduce(length, +, params(cpumodel))
            nparams_diff = nparams_pre - nparams_post
            pruned_perc = 100 * round(nparams_diff / nparams_pre;sigdigits=2)
            @info "Pruned $pruned_perc% of parameters, went from $nparams_pre params to $nparams_post params"
            model = gpu(cpumodel)
    return model

function accuracy(model, iter) 
    acc,cnt = 0, 0
    for (x,y) in iter
        correct = Flux.onecold(model(x)) .== Flux.onecold(y)
        acc += sum(correct)
        cnt += length(correct)
    return acc / cnt

Output from runit():

[ Info: Accuracy before training: 0.7493492004462625
[ Info: Begin epoch 1
[ Info:   Accuracy: 0.9315730754927483
[ Info: Pruned 19.0% of parameters, went from 11179984 params to 9035198 params
[ Info: Begin epoch 2
[ Info:   Accuracy: 0.9126069170695426
[ Info: Pruned 18.0% of parameters, went from 9035198 params to 7451981 params
[ Info: Begin epoch 3
[ Info:   Accuracy: 0.9137225734473782
[ Info: Pruned 17.0% of parameters, went from 7451981 params to 6156741 params
[ Info: Begin epoch 4
[ Info:   Accuracy: 0.9021941242097434
[ Info: Pruned 17.0% of parameters, went from 6156741 params to 5095766 params
[ Info: Begin epoch 5
[ Info:   Accuracy: 0.8876905912978803
[ Info: Begin epoch 6
[ Info:   Accuracy: 0.9111193752324284
[ Info: Pruned 17.0% of parameters, went from 5095766 params to 4251426 params
[ Info: Begin epoch 7
[ Info:   Accuracy: 0.8687244328746746
[ Info: Begin epoch 8
[ Info:   Accuracy: 0.9092599479360357
[ Info: Pruned 16.0% of parameters, went from 4251426 params to 3578973 params
[ Info: Begin epoch 9
[ Info:   Accuracy: 0.8757902566009669
[ Info: Begin epoch 10
[ Info:   Accuracy: 0.8947564150241726
[ Info: Begin epoch 11
[ Info:   Accuracy: 0.895128300483451
[ Info: Begin epoch 12
[ Info:   Accuracy: 0.9010784678319078
[ Info: Pruned 17.0% of parameters, went from 3578973 params to 2978700 params
[ Info: Begin epoch 13
[ Info:   Accuracy: 0.8783934548159167
[ Info: Begin epoch 14
[ Info:   Accuracy: 0.8928969877277798
[ Info: Begin epoch 15
[ Info:   Accuracy: 0.8847155076236519
[ Info: Begin epoch 16
[ Info:   Accuracy: 0.885459278542209
[ Info: Begin epoch 17
[ Info:   Accuracy: 0.8862030494607661
[ Info: Begin epoch 18
[ Info:   Accuracy: 0.9014503532911863
[ Info: Pruned 15.0% of parameters, went from 2978700 params to 2545335 params
[ Info: Begin epoch 19
[ Info:   Accuracy: 0.8642618073633321
[ Info: Begin epoch 20
[ Info:   Accuracy: 0.871699516548903

Perhaps unsurprisingly, we could prune almost 80% of the parameters and still get quite high accuracy as imagenette is just 10 labels.


Very nice work! I recently have been working in automatic pruning networks, and your packages could help me a lot.

One question, working with NaiveNASflux, if the output of a dense layer is update, the previous weights are changed or they are maintained? For me it is not clear how the mutation affect to their parameters. Thanks again for the packages.

Thanks for showing interest!

They are indeed changed (unless you explicitly state that it shall not). This is precisely the point of the packages I would say.

If you for example decide to prune output neuron N from a dense layer (i.e row N from the weight and column N from the bias), it will remove input neuron N (i.e column N from the weight matrix) from the next dense layer. This propagates correctly through things like activations, batchnorm, elementwise ops and concatenation. This is what I ended up needing to use JuMP for, as for arbitrary nestings of such ops (something a random neural architecture search might do) it was just not possible to do without considering everything at the same time.

Not only does it ensure that the model size consistent (i.e doesn’t throw a dimension mismatch error), it also makes sure that for every surving neuron, it is connected to the same neuron as it previously was. I tried to make this very clear in this example and I would love some feedback on how well the example (and the preceeding examples) explains it :slight_smile:

For example, one reason why each prune step above does not remove exactly 20% of the parameters is that parameters are “connected” through the residual connections and the batchnorms, so even if layer X has negative utility for neuron N, discarding neuron N might mean you need to discard neuron N from layer Y which has a higher positive utility for the same neuron, so the solver will decide to keep it.