Julia Performance - Help Needed

lmiq · September 13, 2021, 4:52pm

My computer is similar to yours, actually seems to have a CPU that is slightly slower, and the 10 epochs here take 44s. This is simply copying and pasting the code into the REPL. Do you get the same time by doing that?

This is what I get on the first run:

Epoch 9: 9427.0 / 10000
Epoch 10: 9418.0 / 10000
 46.674844 seconds (23.15 M allocations: 357.851 GiB, 4.54% gc time, 3.40% compilation time)

and on the second run:

Epoch 9: 9414.0 / 10000
Epoch 10: 9401.0 / 10000
 45.284781 seconds (18.15 M allocations: 357.584 GiB, 3.76% gc time)

My profiling says much of the time is spent on these lines:

delta_nabla_b, delta_nabla_w = backpropagation(x, y, biases, weights)
biases  = [b - (eta / length(minibatch)) * nb for (b, nb) in zip(biases, delta_nabla_b)]
weights = [w - (eta / length(minibatch)) * nw for (w, nw) in zip(weights, delta_nabla_w)]

and these lines are the likely reason for those millions of allocations. biases is a vector of vectors (it seems), and you are allocating a new biases and all the vectors that are within it repeatedly.

This is not to say that this is the reason for being slower than python (maybe in the python implementation the same thing is going on). But a careful rewrite of those parts to avoid the repeated allocation of these arrays (at least in the most inner loops) will probably accelerate a lot the code. That may not be apples to apples, or simpler, though, but feasible in Julia.

way2toronto · September 13, 2021, 4:58pm

Yes, I am running it twice. Once using triangle button (“Execute Active File in REPL”) and the second time which I use for timing (one line with @time) directly from REPL.
The output is: 170.939529 seconds (18.15 M allocations: 357.584 GiB, 18.24% gc time)
RAM shouldn’t be an issue as on that screenshot it shows that only 3GB is used out of 12

Oscar_Smith · September 13, 2021, 5:03pm

20% gc time is a lot though.

ToucheSir · September 13, 2021, 5:50pm

To add onto @Oscar_Smith’s point, the GC doesn’t wait until all memory is consumed before collecting. Thankfully, reducing GC load is an easy change:

function backpropagation(x, y, biases, weights, nabla_b, nabla_w)
    a = x
    as = [a]
    zs = Matrix{Float64}[]
    for (b, w) in zip(biases, weights)
        z = w * a + b
        push!(zs, z)
        a = sigmoid.(z)
        push!(as, a)
    end
    delta = cost_prime.(as[end], y) .* sigmoid_prime.(zs[end])
    nabla_b[end] .+= delta
    nabla_w[end] .+= delta * as[end - 1]'
    for l in 1:length(zs) - 1
        delta = weights[end - l + 1]' * delta .* sigmoid_prime.(zs[end - l])
        nabla_b[end - l] .+= delta
        nabla_w[end - l] .+= delta * as[end - l - 1]'
    end
end

function update_mini_batch(minibatch, eta, biases, weights)
    nabla_b = [zeros(size(b)) for b in biases]
    nabla_w = [zeros(size(w)) for w in weights]
    for (x, y) in minibatch
        backpropagation(x, y, biases, weights, nabla_b, nabla_w)
    end
    for (b, nb) in zip(biases, nabla_b)
        b .-= (eta / length(minibatch)) .* nb
    end
    for (w, nw) in zip(weights, nabla_w)
        w .-= (eta / length(minibatch)) .* nw
    end
    return biases, weights
end

For a fair comparison, here’s the equivalent Python version:

    def update_mini_batch(self, mini_batch, eta):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            self.backprop(x, y, nabla_b, nabla_w)
        for w, nw in zip(self.weights, nabla_w):
            w -= eta / len(mini_batch) * nw
        for b, nb in zip(self.biases, nabla_b):
            b -= eta / len(mini_batch) * nb

    def backprop(self, x, y, nabla_b, nabla_w):
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
        nabla_b[-1] += delta
        nabla_w[-1] += np.dot(delta, activations[-2].transpose())
        for l in range(2, self.num_layers):
            z = zs[-l]
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sigmoid_prime(z)
            nabla_b[-l] += delta
            nabla_w[-l] += np.dot(delta, activations[-l-1].transpose())

With these changes, I get similar timings between Python and Julia without any warmup (~25s), with about 3x less time spent in GC. I suspect the relative differences will be even greater on your machine.

way2toronto · September 13, 2021, 6:00pm

Yes, that’s where the network training is being performed. The outer vector represents the layers, and the inner vectors are the pairwise weights. Since the layers have different number of neurons, the structure cannot be easily flattened

IlianPihlajamaa · September 13, 2021, 6:19pm

(I deleted the above post because I found an error in my code, it’s fixed now)

As others have mentioned, optimized Python code translated literally is often very slow Julia code. In Julia, you want to write loops, while in Python you have to avoid them. Having changed a few lines of vector operations to loops, the Julia implementation becomes faster than the Python one. Note that I did not change the algorithm, I just expanded what numpy is doing internally. The new code runs in 30s on my computer compared to 40s for python. I just took the most low-hanging fruit. I believe that, with some effort, another factor of 2 might be possible.

code

using MLDatasets, Random, BenchmarkTools, LoopVectorization, LinearAlgebra

function vectorized_result(y)
    e = zeros(10)
    e[y + 1] = 1
    return e
end

process_sample(x, y) = collect(reshape(x, 28 * 28, 1)), vectorized_result(y)

@inline sigmoid(z) = 1 / (1 + exp(-z))

@inline function sigmoid_prime(z)
    s = sigmoid(z)
    return s * (1 - s)
end

function predict(a, biases, weights)
    for (b, w) in zip(biases, weights)
        a = sigmoid.(w * a .+ b)
    end
    return vec(a)
end

function evaluate(test_data, biases, weights)
    test_results = [(argmax(predict(x, biases, weights)), y) for (x, y) in test_data]
    return sum(y[pred] for (pred, y) in test_results)
end

cost_prime(output, y) = output - y


function myproduct(out, A, x, b)
    (Nx, Ny) = size(A)
    for i = 1:Ny
        out_i = 0.0
        for j = 1:Nx
            out_i += A[j,i]*x[j]
        end
        out[i] = out_i
    end
    for i = 1:Ny
        out[i] *= sigmoid_prime(b[i])
    end
end

function backpropagation!(delta_nabla_b, delta_nabla_w, delta, x, y, biases, weights)
    a = x
    as = [a]
    zs = Matrix{Float64}[]
    for (b, w) in zip(biases, weights)
        z = w * a + b
        push!(zs, z)
        a = sigmoid.(z)
        push!(as, a)
    end
    delta[1] .= cost_prime.(as[end], y) .* sigmoid_prime.(zs[end])
    delta_nabla_b[end] .= delta[1]
    delta_nabla_w[end] .= delta[1] * as[end - 1]'
    for l in 1:length(zs) - 1
        myproduct(delta[l+1], weights[end - l + 1], delta[l], zs[end - l])
        delta_nabla_b[end - l] .= delta[l+1]
        mul!(delta_nabla_w[end - l], delta[l+1], as[end - l - 1]')
    end
    return delta_nabla_b, delta_nabla_w
end

function update_mini_batch(minibatch, eta, biases, weights, delta_nabla_b, delta_nabla_w, delta)
    for (x, y) in minibatch
        delta_nabla_b, delta_nabla_w = backpropagation!(delta_nabla_b, delta_nabla_w, delta, x, y, biases, weights)
        # biases  = [b - (eta / length(minibatch)) * nb for (b, nb) in zip(biases, delta_nabla_b)]
        # weights = [w - (eta / length(minibatch)) * nw for (w, nw) in zip(weights, delta_nabla_w)]
        for i in eachindex(biases)
            for j in eachindex(biases[i])
                biases[i][j] -= (eta / length(minibatch)) * delta_nabla_b[i][j]
            end
        end
        for i in eachindex(weights)
            @turbo for j in eachindex(weights[i])
                weights[i][j] -= (eta / length(minibatch)) * delta_nabla_w[i][j]
            end
        end
    end
    return biases, weights
end

function SGD(train_data, epochs, mini_batch_size, eta, biases, weights, test_data, delta_nabla_b, delta_nabla_w, delta)
    n_train = length(train_data)
    for j in 1:epochs
        shuffle!(train_data)
        train_minibatches = [train_data[k:k + mini_batch_size - 1] for k in 1:mini_batch_size:n_train]
        for minibatch in train_minibatches
            (biases, weights) = update_mini_batch(minibatch, eta, biases, weights, delta_nabla_b, delta_nabla_w, delta)
        end
        res = evaluate(test_data, biases, weights)
        print("Epoch $j: $res / $(length(test_data))\n")
    end
    return biases, weights
end


let
    Random.seed!(1234)
    train_x, train_y = MNIST.traindata(Float64, 1:50_000);
    train_data = process_sample.(eachslice(train_x, dims=3), train_y);

    valid_x, valid_y = MNIST.traindata(Float64, 50_001:60_000);
    valid_data = process_sample.(eachslice(valid_x, dims=3), valid_y);

    test_x, test_y = MNIST.testdata(Float64);
    test_data = process_sample.(eachslice(test_x, dims=3), test_y);

    layer_sizes = [784, 30, 10];
    biases = [randn(l, 1) for l in layer_sizes[2:end]];
    weights = [randn(ll, l) for (l, ll) in zip(layer_sizes[1:end - 1], layer_sizes[2:end])];
    delta_nabla_b = [similar(b) for b in biases]
    delta_nabla_w = [similar(w) for w in weights]
    delta = [similar(b) for b in biases[end:-1:1]]
    SGD(train_data, 1, 10, 3.0, biases, weights, test_data, delta_nabla_b, delta_nabla_w, delta);
    @time SGD(train_data, 10, 10, 3.0, biases, weights, test_data, delta_nabla_b, delta_nabla_w, delta);
    #@profview SGD(train_data, 1, 10, 3.0, biases, weights, test_data);
end

Edit: with an extra @turbo remaining time is 17s.

way2toronto · September 13, 2021, 6:32pm

Yes, thanks. I am getting 90s Julia vs 70s Python now. Good point about keeping an eye on GC.

way2toronto · September 13, 2021, 6:54pm

Nice, this did it. Seeing outperformance 40s vs 70s. Thanks for the example! I guess I understand now what you guys meant about not using vectorized version in Julia. Quite unintuitive though, I assumed that the default matrix multiplication in Julia should be efficient… Why not make it so? Anyways, unless anyone has any final thoughts, I think I can consider the issue closed (at least for myself) as I’ve learned from all of you quite a few useful tricks that will be very helpful for me and for other Julia novices. Also, if you have any good Julia references, pls post them too! Much appreciated.

IlianPihlajamaa · September 13, 2021, 7:00pm

You are right, the Julia matmul should work too… This was easier for me to make in non-allocating (otherwise I had to look up what BLAS function does a non-allocating A’x+b). The way you did it, an intermediate vector A’x is allocated before it is added to b, which is unnecessary. For further reading material, see the performance tips in the documentation here. They list the most important steps to take to write performant Julia code. Cheers!

lmiq · September 13, 2021, 7:14pm

using LinearAlgebra

myproduct2(out,A,x,b) = out .= mul!(out, A', x) .* sigmoid_prime.(b)

because this is Julia, that does not peform better or worse than your loop, though.

and you can turbo-it as well:

julia> myproduct2(out,A,x,b) = out .= mul!(out, A', x) .* sigmoid_prime.(b)
myproduct2 (generic function with 1 method)

julia> @btime myproduct2($out,$A,$x,$b);
  143.758 ns (0 allocations: 0 bytes)

julia> using LoopVectorization

julia> myproduct2(out,A,x,b) = @turbo out .= mul!(out, A', x) .* sigmoid_prime.(b)
myproduct2 (generic function with 1 method)

julia> @btime myproduct2($out,$A,$x,$b);
  98.687 ns (0 allocations: 0 bytes)

DNF · September 13, 2021, 7:29pm

Sorry that I won’t be very helpful, the code is just too big to get into right now, but it looks to me from even the later code samples that there’s a huge amount of performance left on the table, with lots of unnecessary allocations. I think that with some more experience you can dramatically speed this up further.

IlianPihlajamaa · September 13, 2021, 7:45pm

Ah you’re right of course, that’s a much neater implementation. Thanks for the correction.

learned_fool · September 14, 2021, 6:47am

this is a pretty cool blog post about loops in julia and how vectorization is counter-intuitive.

ToucheSir · September 14, 2021, 3:59pm

It is efficient! Case in point, see these final 2 Julia and Python versions that are shorter, only use vectorized built-ins/broadcasting and aren’t too worried about allocating intermediate arrays outside of accumulating to the weights:

using LinearAlgebra, Random, Statistics
using MLDatasets


sigmoid(z) = 1 / (1 + exp(-z))

function sigmoid_prime(z)
    s = sigmoid(z)
    return s * (1 - s)
end

function predict(a, biases, weights)
    for (b, w) in zip(biases, weights)
        a = sigmoid.(w * a .+ b)
    end
    return a
end

function evaluate(test_x, test_y, biases, weights)
    test_results = predict(test_x, biases, weights)
    max_idxs = argmax(test_results, dims=1)
    # This is an array of 2d coordinates (CartesianIndex),
    # which means we can use it to check whether each one-hot vector matches
    return sum(test_y[max_idxs])
end

cost_prime(output, y) = output - y

function update_mini_batch(x, y, eta, biases, weights)
    batch_size = size(y, 2)
    a = x
    as = [a]
    zs = Matrix{Float64}[]
    for (b, w) in zip(biases, weights)
        z = w * a .+ b
        push!(zs, z)
        a = sigmoid.(z)
        push!(as, a)
    end

    delta = cost_prime.(as[end], y) .* sigmoid_prime.(zs[end])
    biases[end] .-= eta .* dropdims(mean(delta, dims=2), dims=2)
    weights[end] .-= (eta / batch_size) .* delta * as[end - 1]'
    for l in 1:length(zs) - 1
        delta = weights[end - l + 1]' * delta .* sigmoid_prime.(zs[end - l])
        biases[end - l] .-= eta .* dropdims(mean(delta, dims=2), dims=2)
        weights[end - l] .-= (eta / batch_size) .* delta * as[end - l - 1]'
    end
end

function SGD(train_x, train_y, epochs, mini_batch_size, eta, biases, weights, test_x, test_y)
    n_test = size(test_y, 2)
    n_train = size(train_y, 2)
    train_idxs = collect(1:n_train)
    for j in 1:epochs
        shuffle!(train_idxs)
        for batch_idxs in Iterators.partition(train_idxs, mini_batch_size)
            update_mini_batch(train_x[:, batch_idxs], train_y[:, batch_idxs], eta, biases, weights)
        end

        res = evaluate(test_x, test_y, biases, weights)
        println("Epoch $j: $res / $n_test")
    end
    return biases, weights
end

let
    # N.B. this would be much better off as Float32, but we're keeping in Float64 for ease of comparison
    train_x = reshape(MNIST.traintensor(Float64, 1:50_000), 784, :)
    # I(10) allocates an identity matrix just like np.eye(10).
    # We collect it to a normal array and use fancy indexing to slice 50k one-hot vectors out of it
    train_y = collect(I(10))[:, MNIST.trainlabels(1:50_000) .+ 1]

    test_x = reshape(MNIST.testtensor(Float64), 784, :)
    test_y = collect(I(10))[:, MNIST.testlabels() .+ 1]

    layer_sizes = (784, 30, 10)
    biases = [randn(l) for l in layer_sizes[2:end]]
    weights = [randn(ll, l) for (l, ll) in zip(layer_sizes[1:end - 1], layer_sizes[2:end])]

    @time SGD(train_x, train_y, 10, 10, 3.0, biases, weights, test_x, test_y);
end

import gzip
import pickle

import numpy as np


class Network:
    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.biases = [np.random.randn(y) for y in sizes[1:]]
        self.weights = [np.random.randn(x, y) for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(a @ w + b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data):
        test_x, test_y = test_data
        n_test = test_y.shape[0]
        train_x, train_y = training_data
        n_train = train_y.shape[0]
        train_idxs = np.arange(n_train)

        for j in range(epochs):
            np.random.shuffle(train_idxs)
            for k in range(0, n_train, mini_batch_size):
                batch_idxs = train_idxs[k : k + mini_batch_size]
                self.update_mini_batch(
                    train_x[batch_idxs, :], train_y[batch_idxs, :], eta
                )

            print(f"Epoch {j} : {self.evaluate(test_x, test_y)} / {n_test}")

    def update_mini_batch(self, x, y, eta):
        batch_size = y.shape[0]
        activation = x
        activations = [x]  # list to store all the activations, layer by layer
        zs = []  # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = activation @ w + b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)

        # backward pass
        delta = cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
        self.biases[-1] -= eta * delta.mean(0)
        self.weights[-1] -= (eta / batch_size) * (activations[-2].T @ delta)
        for l in range(2, self.num_layers):
            z = zs[-l]
            delta = (delta @ self.weights[-l + 1].T) * sigmoid_prime(z)
            self.biases[-l] -= eta * delta.mean(0)
            self.weights[-l] -= (eta / batch_size) * (activations[-l - 1].T @ delta)

    def evaluate(self, test_x, test_y):
        max_idxs = np.expand_dims(self.feedforward(test_x).argmax(-1), -1)
        return np.take_along_axis(test_y, max_idxs, -1).sum()


#### Miscellaneous functions
def cost_derivative(output_activations, y):
    return output_activations - y


def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))


def sigmoid_prime(z):
    x = sigmoid(z)
    return x * (1 - x)


def load_data(data_path):
    with gzip.open(data_path, "rb") as f:
        (train_x, train_y), _, (test_x, test_y) = pickle.load(f, encoding="latin1")
    train_x = train_x.reshape(-1, 784)
    train_y = np.eye(10)[train_y, :]
    test_x = test_x.reshape(-1, 784)
    test_y = np.eye(10)[test_y, :]
    return (train_x, train_y), (test_x, test_y)


if __name__ == "__main__":
    import time

    training_data, test_data = load_data("mnist.pkl.gz")
    training_data = list(training_data)
    net = Network((784, 30, 10))

    tstart = time.perf_counter()
    net.SGD(training_data, 10, 10, 3.0, test_data=test_data)
    tstop = time.perf_counter()
    print(f"{tstop - tstart:0.4f}")

Benchmarking on my machine, these are within ~200ms of each other and at least 2x faster than the next fastest solution so far. The trick? We make as much use of the efficient default matmul as possible by passing in a batch of data instead of feeding in one sample at a time. This is exactly how most ML frameworks do things, and the original author was so, so close to doing it too (but inexplicably decided not to). I’d argue the implementation isn’t any more complex or obscure than the original either. For reference, most of the actual calculations were pulled from an old homework assignment

So TL;DR, the default matmul is the default because it’s darned fast, and it’s darned fast in most languages because they wrap similar optimized libraries (OpenBLAS, MKL, BLIS, etc). For better or worse, you’re unlikely to squeeze more juice out of this benchmark because most of the load-bearing routines have been optimized to the nines already. Where a language which can perform aggressive compiler optimizations like Julia shines is in problems that don’t have existing optimized libraries ready, see Useful Algorithms That Are Not Optimized By Jax, PyTorch, or Tensorflow - Stochastic Lifestyle for a great example.

way2toronto · September 14, 2021, 4:20pm

This is great. Thanks so much for the detailed example to help understand optimization in Julia (and Python)

IlianPihlajamaa · September 14, 2021, 5:25pm

You are absolutely right! Where possible always use the built-in matmuls, and use them in as large as possible batches. They will beat whatever you are going to write hands-down (especially for large matrices). However, again in your code, you can win an additional easy 40% (on my machine) by pre-allocating the temporary array that is created by the matrix multiply in the weight update.

Sukera · September 14, 2021, 8:21pm

Note that this is not the same as using matmuls for everything/forcing everything into a vectorized style, especially if it’s not natural for the problem at hand. It just happens that in this case, the core problem is already best expressed as matmuls.

ToucheSir · September 15, 2021, 3:42pm

Of course, and something like 5-arg mul! would even let you do most of the work in one go. The point I wanted to make was that even without any changes that are solely intended as optimizations (i.e. ones that obfuscate the underlying math), you can write something that performs more than well enough.

I don’t blame the original author for not knowing that matmul already does summation across the batch dimension for you so that they didn’t have to feed in samples one at a time, goodness knows my matrix algebra is pretty terrible too! That said, I do feel this should be table stakes when one wants to write a textbook on the subject…

way2toronto · September 16, 2021, 2:02pm

I am interested in those easy 40%. A relevant code snippet, please?

IlianPihlajamaa · September 16, 2021, 10:23pm

For the best performance, you should probably implement it in Pytorch or Keras, or some other purpose-built library. But here you go: from the code that @ToucheSir sent, change

weights[end - l] .-= (eta / batch_size) .* delta * as[end - l - 1]'

into

mul!(weights[end - l], delta, as[end - l - 1]', -(eta / batch_size), 1.0)

as he suggested. That speeds up the code from 8.1s to 4.9s on my computer.

Getting rid of the remaining 4GB of allocations might win you another one second or so.

EDIT: this runs in 3.5sec (I’m sorry. there are some ugly loops in there, but I’m late and it’s tired )

using LinearAlgebra, Random, Statistics, LoopVectorization
using MLDatasets


sigmoid(z) = 1 / (1 + exp(-z))

function sigmoid_prime(z)
    s = sigmoid(z)
    return s * (1 - s)
end

function predict(a, biases, weights)
    for (b, w) in zip(biases, weights)
        a = sigmoid.(w * a .+ b)
    end
    return a
end

function evaluate(test_x, test_y, biases, weights)
    test_results = predict(test_x, biases, weights)
    max_idxs = argmax(test_results, dims=1)
    # This is an array of 2d coordinates (CartesianIndex),
    # which means we can use it to check whether each one-hot vector matches
    return sum(test_y[max_idxs])
end

cost_prime(output, y) = output - y

function update_mini_batch(x, y, eta, biases, weights, delta, as, zs)
    batch_size = size(y, 2)
    a = x
    as[1] .= a
    i = 1
    for (b, w) in zip(biases, weights)
        mul!(zs[i], w, as[i], 1.0, 0.0)
        zs[i] .+= b
        i += 1
        as[i] .=  sigmoid.(zs[i-1])
    end

    delta[1] .= cost_prime.(as[end], y) .* sigmoid_prime.(zs[end])
    biases[end] .-= eta .* dropdims(mean(delta[1], dims=2), dims=2)
    weights[end] .-= (eta / batch_size) .* delta[1] * as[end - 1]'
    for l in 1:length(zs) - 1
        mul!(delta[l+1], weights[end - l + 1]', delta[l], 1.0, 0.0)
        delta[l+1] .*= sigmoid_prime.(zs[end - l])
        # delta[l+1] .= weights[end - l + 1]' * delta[l] .* sigmoid_prime.(zs[end - l])
         biases[end - l] .-= eta .* dropdims(mean(delta[l+1], dims=2), dims=2)
        mul!(weights[end - l], delta[l+1], as[end - l - 1]', -(eta / batch_size), 1.0)
    end
end

function SGD(train_x, train_y, epochs, mini_batch_size, eta, biases, weights, test_x, test_y, delta, layer_sizes)
    n_test = size(test_y, 2)
    n_train = size(train_y, 2)
    train_idxs = collect(1:n_train)
    as = [zeros(layer_sizes[i], mini_batch_size) for i=1:length(layer_sizes)]
    zs = [zeros(layer_sizes[i], mini_batch_size) for i=2:length(layer_sizes)]
    x = train_x[:, 1:mini_batch_size]

    for j in 1:epochs
        shuffle!(train_idxs)
        for batch_idxs in Iterators.partition(train_idxs, mini_batch_size)
            @turbo for sample = 1:mini_batch_size
                sample_index = batch_idxs[sample]
                for pixel = 1:size(x)[1]
                    x[pixel, sample] = train_x[pixel, sample_index]
                end
            end
            y = train_y[:, batch_idxs]
            update_mini_batch(x, y, eta, biases, weights, delta, as, zs)
        end

        res = evaluate(test_x, test_y, biases, weights)
        println("Epoch $j: $res / $n_test")
    end
    return biases, weights
end

let
    mini_batch_size = 10
    # N.B. this would be much better off as Float32, but we're keeping in Float64 for ease of comparison
    train_x = reshape(MNIST.traintensor(Float64, 1:50_000), 784, :)
    # I(10) allocates an identity matrix just like np.eye(10).
    # We collect it to a normal array and use fancy indexing to slice 50k one-hot vectors out of it
    train_y = collect(I(10))[:, MNIST.trainlabels(1:50_000) .+ 1]

    test_x = reshape(MNIST.testtensor(Float64), 784, :)
    test_y = collect(I(10))[:, MNIST.testlabels() .+ 1]

    layer_sizes = (784, 30, 10)
    biases = [randn(l) for l in layer_sizes[2:end]]
    weights = [randn(ll, l) for (l, ll) in zip(layer_sizes[1:end - 1], layer_sizes[2:end])]
    delta = [zeros(length(b), mini_batch_size) for b in biases[end:-1:1]]
    epochs = 10
    eta = 3.0
    SGD(train_x, train_y, 1, mini_batch_size, eta, biases, weights, test_x, test_y, delta, layer_sizes);
    @time SGD(train_x, train_y, epochs, mini_batch_size, eta, biases, weights, test_x, test_y, delta, layer_sizes);
end