Julia Performance - Help Needed

Hello. Given the promises of performance, I’ve started my Julia work with a translation of a simple deep learning algo to Julia. My Julia code is below and the original Python code network.py is from the book by Michael Nielsen (GitHub - mnielsen/neural-networks-and-deep-learning: Code samples for my book "Neural Networks and Deep Learning"). The Julia performance is about 2.5 times slower than the Python one. Please help to understand where I am loosing my performance. I know the code is not optimized (both Julia and Python), but they are in line with one another. If I can at least match Python’s performance, that would be great, but, of course, it is desirable to outperform it. Thanks.

P.S. I’ve made changes based on several suggestions, but underperformance remains at the same level. I’ve been looking at CPU utilization and it is at about 30% with Julia code and 70% with Python code. Julia seems to be running on 1 core, while Python is not that clear, but looks like it utilizes more cores by default (does this make sense?). If that is the case, then I need to parallelize the Julia code to be compared with Python at the same CPU utilization. Any suggestions? Thanks guys.

===================================================================

using MLDatasets, Random

# Data
train_x_orig, train_y_orig = MNIST.traindata();
test_x_orig,  test_y_orig  = MNIST.testdata();

function vectorized_result(y)
    e = zeros(10)
    e[y+1] = 1
    return e
end

train_x, train_y = Float64.(reshape(train_x_orig[:,:,1:50000], (28*28, :))), [vectorized_result(y) for y in train_y_orig[1:50000]];
valid_x, valid_y = Float64.(reshape(train_x_orig[:,:,50001:end], (28*28, :))), train_y_orig[50001:end];
test_x, test_y = Float64.(reshape(test_x_orig, (28*28, :))), test_y_orig;

function sigmoid(z)
    return @. 1.0./(1.0 + exp(-z))
end

function sigmoid_prime(z)
    s = sigmoid(z);
    return s.*(1-s);
end

function predict(a, biases, weights)
    for (b, w) in zip(biases, weights)
        a = sigmoid.(w*a .+ b)
    end
    return vec(a)
end

function evaluate(test_x, test_y, biases, weights)
    test_results = [argmax(predict(col, biases, weights))-1 for col in eachcol(test_x)]
    return sum(test_results.==test_y)
end

function cost_prime(output, y)
    return output - y
end

function backpropagation(x, y, biases, weights)
    nabla_b = [zeros(size(b)) for b in biases];
    nabla_w = [zeros(size(w)) for w in weights];
    a = x;
    as = [a];
    zs = Vector{Float64}[];
    for (b, w) in zip(biases, weights)
        z = w * a + b;
        push!(zs, z);
        a = sigmoid.(z);
        push!(as, vec(a));
    end
    delta = cost_prime.(as[end], y) .* sigmoid_prime.(zs[end]);
    nabla_b[end] = delta;
    nabla_w[end] = delta*as[end-1]';
    for l=1:length(zs)-1
        delta = weights[end-l+1]' * delta .* sigmoid_prime.(zs[end-l]);
        nabla_b[end-l] = delta;
        nabla_w[end-l] = delta * as[end-l-1]';
    end
    return (nabla_b, nabla_w);
end

function update_mini_batch(train_x_minibatch, train_y_minibatch, mini_batch_size, eta, biases, weights)
    nabla_b = [zeros(size(b)) for b in biases];
    nabla_w = [zeros(size(w)) for w in weights];
    for i=1:mini_batch_size
        (delta_nabla_b, delta_nabla_w) = backpropagation(train_x_minibatch[:, i], train_y_minibatch[i], biases, weights);
        nabla_b = [nb + dnb for (nb, dnb) in zip(nabla_b, delta_nabla_b)]; nabla_w = [nw + dnw for (nw, dnw) in zip(nabla_w, delta_nabla_w)]; # Loop over layers
    end
    biases  = [b - nb*eta/mini_batch_size for (b, nb) in zip(biases, nabla_b)];
    weights = [w - nw*eta/mini_batch_size for (w, nw) in zip(weights, nabla_w)];
    return (biases, weights);
end

function SGD(train_x, train_y, epochs, mini_batch_size, eta, biases, weights, test_x, test_y)
    n_test = length(test_y);
    n_train = length(train_y);
    for j=1:epochs
        train_x_minibatches = [train_x[:, k:k+mini_batch_size-1] for k=1:mini_batch_size:n_train];
        train_y_minibatches = [train_y[k:k+mini_batch_size-1] for k=1:mini_batch_size:n_train];
        for mini_batch=1:length(train_y_minibatches)
            (biases, weights) = update_mini_batch(train_x_minibatches[mini_batch], train_y_minibatches[mini_batch], mini_batch_size, eta, biases, weights);
        end
        res = evaluate(test_x, test_y, biases, weights);
        print("Epoch $j: $res / $n_test\n");
    end
    return (biases, weights);
end

layer_sizes = [784, 30, 10];
biases = [randn(l) for l in layer_sizes[2:end]];
weights = [randn(ll, l) for (l, ll) in zip(layer_sizes[1:end-1], layer_sizes[2:end])];

@time (biases_opt, weights_opt) = SGD(train_x, train_y, 5, 10, 3.0, biases, weights, test_x, test_y);

You can edit your post adding triple backticks

```
like this
```

to get code formatting, which will make it easier to read.

Thanks!

This won’t solve the whole thing, but defining

function sigmoid_prime(z)
    s = sigmoid(z)
    return s*(1-s)
end

will be about 2x faster (for that part), since it doesn’t have to calculate sigmoid twice.

How about

function sigmoid_prime(z)
    s = sigmoid(z)
    return muladd(-s, x, x)
end

This should use a single vfnmadd (fused negative multiply add) instruction instead of a sub and mul.
Plenty of other minor possibilities, e.g.

function predict(a, biases, weights)
    for (b, w) in zip(biases, weights)
        a .= sigmoid.(w*a .+ b);
    end
    return vec(a);
end
2 Likes

I was going for more new user targeted feedback, but that would be better.

2 Likes

I didn’t read all your code, but this

is a big red flag. It will create a Vector{Any}, which is terrible for performance. Make sure to give it a proper type, such as

zs = Float64[] 

or whatever is appropriate.

3 Likes

This isn’t so much about performance, but there’s little reason to create an nx1 matrix and then turn it into a vector:

Just make it a vector in the first place:

e = zeros(10)
e[y+1] = 1
return e

And drop the semicolons.

2 Likes

Thanks. This has sped up the code by about 5%. But so did the Python code, so the relative performance remains about 2.5x in favour of Python.

Agreed. I’ve made the changes.

Changed this as well, however it doesn’t help with the performance. I even tried pre-allocating as and zs at a parent function level and passing as arguments, but doesn’t help either. The bottleneck is somewhere else.

Thank you and I agree with the suggestion if absolute performance is needed. However, I try to keep the Julia code in sync with Python part so that the comparison is apples to apples.

You code seems reasonably well structured in functions. Do you know where is the bottleneck? Which functions take most of the time?

You can profile it: Profiling · The Julia Language

and then ask for help in optimizing the function which is slowing your code down.

At the same time, it is not reasonable to “compare apples to apples” in that sense: on the one hand, if you keep your code as the same series of vectorized operations, i. e. if the python code is already just a bunch of vectorized calls to library functions in all performant critical parts, then there is no reason for Julia to be faster. Julia will be similar in speed if these vector operations are made “correctly” from the Julia point of view as well (mainly I mean that you have to take care of using @views and fusing broadcasted operations; there are some situations where you may be getting that from the python syntax and not from the Julia one as it is).

But to really take advantage of Julia you need to use the structure of your problem and operations and liberate your self from the constraint of having to write everything as a series vectorized operations. Then, if your problem has specific characteristics that allow optimization, the Julia code can be much faster. The possibility of fusing many vector operations in a loop, avoiding a lot of intermediate vectors, is one of those possibilities.

2 Likes

I took the liberty of tweaking the code to be as close to the updated Python implementation at DeepLearningPython/network.py at master · MichalDanielDobrzanski/DeepLearningPython · GitHub as possible:

using MLDatasets, Random, BenchmarkTools

function vectorized_result(y)
    e = zeros(10)
    e[y + 1] = 1
    return e
end

process_sample(x, y) = collect(reshape(x, 28 * 28, 1)), vectorized_result(y)


sigmoid(z) = 1 / (1 + exp(-z))

function sigmoid_prime(z)
    s = sigmoid(z)
    return s * (1 - s)
end

function predict(a, biases, weights)
    for (b, w) in zip(biases, weights)
        a = sigmoid.(w * a .+ b)
    end
    return vec(a)
end

function evaluate(test_data, biases, weights)
    test_results = [(argmax(predict(x, biases, weights)), y) for (x, y) in test_data]
    return sum(y[pred] for (pred, y) in test_results)
end

cost_prime(output, y) = output - y

function backpropagation(x, y, biases, weights, nabla_b, nabla_w)
    nabla_b = [zeros(size(b)) for b in biases]
    nabla_w = [zeros(size(w)) for w in weights]
    a = x
    as = [a]
    zs = Matrix{Float64}[]
    for (b, w) in zip(biases, weights)
        z = w * a + b
        push!(zs, z)
        a = sigmoid.(z)
        push!(as, a)
    end
    delta = cost_prime.(as[end], y) .* sigmoid_prime.(zs[end])
    nabla_b[end] = delta
    nabla_w[end] = delta * as[end - 1]'
    for l in 1:length(zs) - 1
        delta = weights[end - l + 1]' * delta .* sigmoid_prime.(zs[end - l])
        nabla_b[end - l] = delta
        nabla_w[end - l] = delta * as[end - l - 1]'
    end
    return nabla_b, nabla_w
end

function update_mini_batch(minibatch, eta, biases, weights)
    nabla_b = [zeros(size(b)) for b in biases]
    nabla_w = [zeros(size(w)) for w in weights]
    for (x, y) in minibatch
        delta_nabla_b, delta_nabla_w = backpropagation(x, y, biases, weights, nabla_b, nabla_w)
        nabla_b = [nb + dnb for (nb, dnb) in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw + dnw for (nw, dnw) in zip(nabla_w, delta_nabla_w)]
    end
    biases  = [b - (eta / length(minibatch)) * nb for (b, nb) in zip(biases, nabla_b)]
    weights = [w - (eta / length(minibatch)) * nw for (w, nw) in zip(weights, nabla_w)]
    return biases, weights
end

function SGD(train_data, epochs, mini_batch_size, eta, biases, weights, test_data)
    n_train = length(train_data)
    for j in 1:epochs
        shuffle!(train_data)
        train_minibatches = [train_data[k:k + mini_batch_size - 1] for k in 1:mini_batch_size:n_train]
        for minibatch in train_minibatches
            (biases, weights) = update_mini_batch(minibatch, eta, biases, weights)
        end
        res = evaluate(test_data, biases, weights)
        print("Epoch $j: $res / $(length(test_data))\n")
    end
    return biases, weights
end

let
    train_x, train_y = MNIST.traindata(Float64, 1:50_000)
    train_data = process_sample.(eachslice(train_x, dims=3), train_y)

    valid_x, valid_y = MNIST.traindata(Float64, 1:50_000)
    valid_data = process_sample.(eachslice(valid_x, dims=3), valid_y)

    test_x, test_y = MNIST.testdata(Float64)
    test_data = process_sample.(eachslice(test_x, dims=3), test_y)
    layer_sizes = [784, 30, 10]
    biases = [randn(l, 1) for l in layer_sizes[2:end]]
    weights = [randn(ll, l) for (l, ll) in zip(layer_sizes[1:end - 1], layer_sizes[2:end])]

    @time SGD(train_data, 5, 10, 3.0, biases, weights, test_data);
end

After warmup, this runs in ~20.5s (-1 or 2s if you warmup) vs ~14.5s for the Python version. As expected, it also uses all cores on my machine. So still a large gap, but nowhere near 2.5x.

Now as @leandromartinez98 mentioned, profiling is a good way to find bottlenecks with the code. If you toss this routine under a profiler (VSCode has a nice visual one):

# Warmup for a cleaner profile
SGD(train_data, 1, 10, 3.0, biases, weights, test_data);
@profview SGD(train_data, 5, 10, 3.0, biases, weights, test_data);

The flamegraph shows mostly what you’d expect: matrix multiplication (mul! and gemm) and pointwise operations (+ and *) that use broadcasting under the hood. What does stand out is the leftmost group: ~4s of computation is spent on zeros -> fill! -> setindex!. Seems weird that zeroing out an array would be on the hot path, but let’s try not doing it and seeing what happens:

# nabla_b = [zeros(size(b)) for b in biases]
# nabla_w = [zeros(size(w)) for w in weights]
nabla_b = [similar(b) for b in biases]
nabla_w = [similar(w) for w in weights]
# nabla_b = [np.zeros(b.shape) for b in self.biases]
# nabla_w = [np.zeros(w.shape) for w in self.weights]
nabla_b = [np.empty(b.shape) for b in self.biases]
nabla_w = [np.empty(w.shape) for w in self.weights]

Now the Julia version is ~15s after warmup. Paradoxically, the Python version is ~1.5s slower! Armed with this information, we can run a direct comparison of both zeros functions:

In [1]: %timeit x = np.zeros((1024, 1024), dtype=np.float64)
285 µs ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

julia> @benchmark x = zeros(Float64, 1024, 1024)
BenchmarkTools.Trial: 7380 samples with 1 evaluation.
 Range (min … max):  567.270 μs …   1.527 ms  ┊ GC (min … max):  0.00% … 58.69%
 Time  (median):     588.213 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   674.429 μs ± 182.927 μs  ┊ GC (mean ± σ):  11.89% ± 16.61%

  █                                                              
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  608 μs           Histogram: frequency by time          592 μs <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

Given how often both the backprop and batch update functions are called, this adds up quickly!

Looking through both implementations, I think the difference is that Numpy uses calloc (which can be lazy) for its zeros implementation, whereas Julia eagerly fills with zeros. There is an issue open for Julia to use calloc for zeros as well, but AFAICT that hasn’t be implemented.


With the objective results out of the way, please allow me a bit of a rant on this. The original Python tutorial implementation is, for lack of a better term, horrendously unoptimized. I understand the need to keep things clear for pedagogical value, but most of the choices made don’t help there or even obfuscate the underlying behaviour. Just a couple examples to hammer home this point:

  1. Instead of creating dense minibatches of shape Nx784, it gloms together N 784x1 samples into one list and runs them through the network one by one. This takes away a lot of the efficiency gains from using BLAS, and it’s not even any clearer because there’s that random unexplained batch dimension in every input in addition to the batch dimension created by the vector length. Had they just used a proper 2d array, this would be a non-issue.

  2. A complete allergy to in-place operations. This:

self.weights = [w-(eta/len(mini_batch))*nw
                for w, nw in zip(self.weights, nabla_w)]

Is not strictly more legible or comprehensible than this:

for w, nw in zip(self.weights, nabla_w):
    w -= eta/len(mini_batch))*nw

Which has the benefit of allocating less. I’d argue it’s more clear for nabla_[bw], as it clearly signals the values are being updated/accumulated into and instead of just redefined.

  1. Useless pre-allocation. This is really the crux of the matter, because it accounts for almost the entire discrepancy between Python and Julia **. For lack of a better term, these two lines are useless because the indices are completely overwritten afterwards. What ought to happen is that update_mini_batch passes the nablas into backprop, which accumulates into them in-place. Not only does that establish a clear relationship between both sets of variables, but it dramatically reduces the amount of garbage generated (for reference, @time reports a staggering 150 GB).

Now complaining about a free thing someone poured their time into on the internet is old hat, but I find it worrying that a tutorial like this still has reach despite many better implementations existing on the internet (the number of search results for “numpy neural network MNIST” is overwhelming) and being stuck on an out-of-date language (Python 2).
So thank you for taking the time to port over this code and post about it. I think the 2 biggest takeaways are a) someone should revisit calloc for zeros, and b) I’m making that github repo/book an explicit anti-recommendation when anybody asks about ML tutorial resources :smiley:

** because almost all of the performance comes from optimized BLAS and pointwise operation routines, you’d be unlikely to see a significant difference between Python, Julia, or any other language that can make use of the above. Conversely, a language with good native codegen shines when such routines do not exist and you need to write them yourself: doing something like OptimalTransport.jl/quadratic_newton.jl at master · JuliaOptimalTransport/OptimalTransport.jl · GitHub with Numpy efficiently would be quite “fun”.

15 Likes

Thanks for the high-level advice. Appreciate the feedback and that’s something will help me learn Julia.
Just to clarify my reference to “apples to apples”, I meant to be comparing similar structure/logic of the code to prevent over-optimizing one or the other, and to keep the implementation effort of about the same. At the same time, as a new user of Julia, all your points are very valuable. Thank you.

Thanks a lot for taking the time to look at the code and suggestions around particular points. I have implemented you version (with minor changes) and I still see 2.1x underperformance. Not sure what is going on. I will post the full versions of the code below, but here are the main points:

  1. I’ve changed the Julia parts on nabla_b, nabla_w – those that are used in backprop (agreed, that they are overwritten and can be initialized by any values for speed) are different from the ones in the update_minibatch (these need to be initialized as 0s since they are being accumulated and they cannot be passed to backrop to avoid being modified at the parent level, otherwise the results will be wrong, e.g. the accuracy on the test results should be at least 9300/10000 after about 4-5 epochs). My way around the problem (see the code below) was to eliminate nablas in the parent function and add accumulated delta_nablas directly to weights. While this works in this particular case, it may not be optimal in more complex deep learning algos, but to eliminate the problem with zeros() that you found, I’ve implemented it that way. Julia version is 1.6.2. and is being run from VSCode. Runtime is 170s for 10 epochs.
  2. I am using Python 3 for my version of the Python code (3.9.6), which is being run from Spyder 5.0.5. Runtime is 80s for 10 epochs
  3. CPU utilization (fluctuates around 25% for Julia and 70% for Python)
    JuliaCPU

Julia code

using MLDatasets, Random, BenchmarkTools

function vectorized_result(y)
    e = zeros(10)
    e[y + 1] = 1
    return e
end

process_sample(x, y) = collect(reshape(x, 28 * 28, 1)), vectorized_result(y)

sigmoid(z) = 1 / (1 + exp(-z))

function sigmoid_prime(z)
    s = sigmoid(z)
    return s * (1 - s)
end

function predict(a, biases, weights)
    for (b, w) in zip(biases, weights)
        a = sigmoid.(w * a .+ b)
    end
    return vec(a)
end

function evaluate(test_data, biases, weights)
    test_results = [(argmax(predict(x, biases, weights)), y) for (x, y) in test_data]
    return sum(y[pred] for (pred, y) in test_results)
end

cost_prime(output, y) = output - y

function backpropagation(x, y, biases, weights)
    delta_nabla_b = [similar(b) for b in biases]
    delta_nabla_w = [similar(w) for w in weights]
    a = x
    as = [a]
    zs = Matrix{Float64}[]
    for (b, w) in zip(biases, weights)
        z = w * a + b
        push!(zs, z)
        a = sigmoid.(z)
        push!(as, a)
    end
    delta = cost_prime.(as[end], y) .* sigmoid_prime.(zs[end])
    delta_nabla_b[end] = delta
    delta_nabla_w[end] = delta * as[end - 1]'
    for l in 1:length(zs) - 1
        delta = weights[end - l + 1]' * delta .* sigmoid_prime.(zs[end - l])
        delta_nabla_b[end - l] = delta
        delta_nabla_w[end - l] = delta * as[end - l - 1]'
    end
    return delta_nabla_b, delta_nabla_w
end

function update_mini_batch(minibatch, eta, biases, weights)
    for (x, y) in minibatch
        delta_nabla_b, delta_nabla_w = backpropagation(x, y, biases, weights)
        biases  = [b - (eta / length(minibatch)) * nb for (b, nb) in zip(biases, delta_nabla_b)]
        weights = [w - (eta / length(minibatch)) * nw for (w, nw) in zip(weights, delta_nabla_w)]        
    end
    return biases, weights
end

function SGD(train_data, epochs, mini_batch_size, eta, biases, weights, test_data)
    n_train = length(train_data)
    for j in 1:epochs
        shuffle!(train_data)
        train_minibatches = [train_data[k:k + mini_batch_size - 1] for k in 1:mini_batch_size:n_train]
        for minibatch in train_minibatches
            (biases, weights) = update_mini_batch(minibatch, eta, biases, weights)
        end
        res = evaluate(test_data, biases, weights)
        print("Epoch $j: $res / $(length(test_data))\n")
    end
    return biases, weights
end


let
    train_x, train_y = MNIST.traindata(Float64, 1:50_000);
    train_data = process_sample.(eachslice(train_x, dims=3), train_y);

    valid_x, valid_y = MNIST.traindata(Float64, 50_001:60_000);
    valid_data = process_sample.(eachslice(valid_x, dims=3), valid_y);

    test_x, test_y = MNIST.testdata(Float64);
    test_data = process_sample.(eachslice(test_x, dims=3), test_y);

    layer_sizes = [784, 30, 10];
    biases = [randn(l, 1) for l in layer_sizes[2:end]];
    weights = [randn(ll, l) for (l, ll) in zip(layer_sizes[1:end - 1], layer_sizes[2:end])];

    @time SGD(train_data, 10, 10, 3.0, biases, weights, test_data);
    #@profview SGD(train_data, 1, 10, 3.0, biases, weights, test_data);
end

Python code

import random
import numpy as np

class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a) + b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [training_data[k:k+mini_batch_size] for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            print("Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test))

    def update_mini_batch(self, mini_batch, eta):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)]
        self.biases  = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        nabla_b = [np.empty(b.shape) for b in self.biases]
        nabla_w = [np.empty(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        for l in range(2, self.num_layers):
            z = zs[-l]
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sigmoid_prime(z)
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        test_results = [(np.argmax(self.feedforward(x)), y) for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        return (output_activations-y)

def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    s = sigmoid(z)
    return s*(1-s)





"""
mnist_loader
"""

import skimage
import skimage.transform
import numpy as np
from tensorflow.keras.datasets import mnist

def load_mnist(target_height=64, target_width=64):
    (x_train, y_train), (x_test, y_test) = mnist.load_data()

    def preprocess_mnist(df, target_height, target_width):
        df = [skimage.transform.resize(x, (target_height, target_width)).astype(np.float64) for x in df]
        return df

    x_train = preprocess_mnist(x_train, target_height, target_width)
    x_test = preprocess_mnist(x_test, target_height, target_width)
    return x_train, y_train, x_test, y_test

def load_data():
    x_train, y_train, x_test, y_test = load_mnist(target_height=28, target_width=28)
    training_data = (x_train[:50000], y_train[:50000])
    validation_data = (x_train[50000:], y_train[50000:])
    test_data = (x_test, y_test)
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = list(zip(training_inputs, training_results))
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = list(zip(validation_inputs, va_d[1]))
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = list(zip(test_inputs, te_d[1]))
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

import time

training_data, validation_data, test_data = load_data_wrapper()

net = Network([784, 30, 10])
tic = time.perf_counter()
net.SGD(training_data, 10, 10, 3.0, test_data=test_data)
toc = time.perf_counter(); print(toc-tic)

And for Python
PythonCPU

At least keep in mind that if ‘similar structure’ means ‘heavy reliance on vectorization’, then you might be giving up on what is the main benefit of Julia, which is that you don’t need to contort your problem to fit a vectorized workflow.

Without looking thoroughly at your code, it is conceivable (though not certain) that less vectorized code would be both shorter, simpler and faster. For that reason I’m a bit skeptical about keeping things ‘apples to apples’, since a style that is optimal in one language is suboptimal in another. It is better to strive for ‘similar effort’ as well as ‘simple and idiomatic’.

2 Likes

Yeah, that I can agree with.
And I’d love see a less vectorized code that is shorter and faster!

Wait, are you running this using the VSCode debugger (i.e. F5)? Please run from the REPL or CLI instead. The debugger will not compile user code by default so that you can set breakpoints, but it should not be used for benchmarking (just like you wouldn’t use the Python debugger for benchmarking).

Also, please post the full output of @time and not just the execution time. This is because we can see how much of a factor GC might be in the runtime **. Make sure you’ve run every function in the callstack at least once before timing too, calling SGD for one epoch should be enough. That eliminates any up-front compilation overhead and will give the best picture of runtime performance. ***

** I ran on a machine with 32GB of RAM. Yours only has 12, so there may be more pressure on the GC.

*** You can still time the full compile/runtime separately, but that won’t be a proper apples-to-apples comparison with Python runtime perf (especially considering Numpy etc. are all precompiled).

1 Like