Julia Performance - Help Needed

way2toronto · September 10, 2021, 7:35pm

Hello. Given the promises of performance, I’ve started my Julia work with a translation of a simple deep learning algo to Julia. My Julia code is below and the original Python code network.py is from the book by Michael Nielsen (https://github.com/mnielsen/neural-networks-and-deep-learning). The Julia performance is about 2.5 times slower than the Python one. Please help to understand where I am loosing my performance. I know the code is not optimized (both Julia and Python), but they are in line with one another. If I can at least match Python’s performance, that would be great, but, of course, it is desirable to outperform it. Thanks.

P.S. I’ve made changes based on several suggestions, but underperformance remains at the same level. I’ve been looking at CPU utilization and it is at about 30% with Julia code and 70% with Python code. Julia seems to be running on 1 core, while Python is not that clear, but looks like it utilizes more cores by default (does this make sense?). If that is the case, then I need to parallelize the Julia code to be compared with Python at the same CPU utilization. Any suggestions? Thanks guys.

===================================================================

using MLDatasets, Random

# Data
train_x_orig, train_y_orig = MNIST.traindata();
test_x_orig,  test_y_orig  = MNIST.testdata();

function vectorized_result(y)
    e = zeros(10)
    e[y+1] = 1
    return e
end

train_x, train_y = Float64.(reshape(train_x_orig[:,:,1:50000], (28*28, :))), [vectorized_result(y) for y in train_y_orig[1:50000]];
valid_x, valid_y = Float64.(reshape(train_x_orig[:,:,50001:end], (28*28, :))), train_y_orig[50001:end];
test_x, test_y = Float64.(reshape(test_x_orig, (28*28, :))), test_y_orig;

function sigmoid(z)
    return @. 1.0./(1.0 + exp(-z))
end

function sigmoid_prime(z)
    s = sigmoid(z);
    return s.*(1-s);
end

function predict(a, biases, weights)
    for (b, w) in zip(biases, weights)
        a = sigmoid.(w*a .+ b)
    end
    return vec(a)
end

function evaluate(test_x, test_y, biases, weights)
    test_results = [argmax(predict(col, biases, weights))-1 for col in eachcol(test_x)]
    return sum(test_results.==test_y)
end

function cost_prime(output, y)
    return output - y
end

function backpropagation(x, y, biases, weights)
    nabla_b = [zeros(size(b)) for b in biases];
    nabla_w = [zeros(size(w)) for w in weights];
    a = x;
    as = [a];
    zs = Vector{Float64}[];
    for (b, w) in zip(biases, weights)
        z = w * a + b;
        push!(zs, z);
        a = sigmoid.(z);
        push!(as, vec(a));
    end
    delta = cost_prime.(as[end], y) .* sigmoid_prime.(zs[end]);
    nabla_b[end] = delta;
    nabla_w[end] = delta*as[end-1]';
    for l=1:length(zs)-1
        delta = weights[end-l+1]' * delta .* sigmoid_prime.(zs[end-l]);
        nabla_b[end-l] = delta;
        nabla_w[end-l] = delta * as[end-l-1]';
    end
    return (nabla_b, nabla_w);
end

function update_mini_batch(train_x_minibatch, train_y_minibatch, mini_batch_size, eta, biases, weights)
    nabla_b = [zeros(size(b)) for b in biases];
    nabla_w = [zeros(size(w)) for w in weights];
    for i=1:mini_batch_size
        (delta_nabla_b, delta_nabla_w) = backpropagation(train_x_minibatch[:, i], train_y_minibatch[i], biases, weights);
        nabla_b = [nb + dnb for (nb, dnb) in zip(nabla_b, delta_nabla_b)]; nabla_w = [nw + dnw for (nw, dnw) in zip(nabla_w, delta_nabla_w)]; # Loop over layers
    end
    biases  = [b - nb*eta/mini_batch_size for (b, nb) in zip(biases, nabla_b)];
    weights = [w - nw*eta/mini_batch_size for (w, nw) in zip(weights, nabla_w)];
    return (biases, weights);
end

function SGD(train_x, train_y, epochs, mini_batch_size, eta, biases, weights, test_x, test_y)
    n_test = length(test_y);
    n_train = length(train_y);
    for j=1:epochs
        train_x_minibatches = [train_x[:, k:k+mini_batch_size-1] for k=1:mini_batch_size:n_train];
        train_y_minibatches = [train_y[k:k+mini_batch_size-1] for k=1:mini_batch_size:n_train];
        for mini_batch=1:length(train_y_minibatches)
            (biases, weights) = update_mini_batch(train_x_minibatches[mini_batch], train_y_minibatches[mini_batch], mini_batch_size, eta, biases, weights);
        end
        res = evaluate(test_x, test_y, biases, weights);
        print("Epoch $j: $res / $n_test\n");
    end
    return (biases, weights);
end

layer_sizes = [784, 30, 10];
biases = [randn(l) for l in layer_sizes[2:end]];
weights = [randn(ll, l) for (l, ll) in zip(layer_sizes[1:end-1], layer_sizes[2:end])];

@time (biases_opt, weights_opt) = SGD(train_x, train_y, 5, 10, 3.0, biases, weights, test_x, test_y);

jzr · September 10, 2021, 7:57pm

You can edit your post adding triple backticks

```
like this
```

to get code formatting, which will make it easier to read.

way2toronto · September 10, 2021, 8:06pm

Thanks!

Oscar_Smith · September 10, 2021, 8:23pm

This won’t solve the whole thing, but defining

function sigmoid_prime(z)
    s = sigmoid(z)
    return s*(1-s)
end

will be about 2x faster (for that part), since it doesn’t have to calculate sigmoid twice.

Elrod · September 10, 2021, 10:47pm

How about

function sigmoid_prime(z)
    s = sigmoid(z)
    return muladd(-s, x, x)
end

This should use a single vfnmadd (fused negative multiply add) instruction instead of a sub and mul.
Plenty of other minor possibilities, e.g.

function predict(a, biases, weights)
    for (b, w) in zip(biases, weights)
        a .= sigmoid.(w*a .+ b);
    end
    return vec(a);
end

Oscar_Smith · September 10, 2021, 10:49pm

I was going for more new user targeted feedback, but that would be better.

DNF · September 10, 2021, 11:15pm

I didn’t read all your code, but this

is a big red flag. It will create a Vector{Any}, which is terrible for performance. Make sure to give it a proper type, such as

zs = Float64[]

or whatever is appropriate.

DNF · September 10, 2021, 11:22pm

This isn’t so much about performance, but there’s little reason to create an nx1 matrix and then turn it into a vector:

Just make it a vector in the first place:

e = zeros(10)
e[y+1] = 1
return e

And drop the semicolons.

way2toronto · September 11, 2021, 6:59pm

Thanks. This has sped up the code by about 5%. But so did the Python code, so the relative performance remains about 2.5x in favour of Python.

way2toronto · September 11, 2021, 7:01pm

Agreed. I’ve made the changes.

way2toronto · September 11, 2021, 7:02pm

Changed this as well, however it doesn’t help with the performance. I even tried pre-allocating as and zs at a parent function level and passing as arguments, but doesn’t help either. The bottleneck is somewhere else.

way2toronto · September 11, 2021, 7:06pm

Thank you and I agree with the suggestion if absolute performance is needed. However, I try to keep the Julia code in sync with Python part so that the comparison is apples to apples.

lmiq · September 11, 2021, 7:48pm

You code seems reasonably well structured in functions. Do you know where is the bottleneck? Which functions take most of the time?

You can profile it: Profiling · The Julia Language

and then ask for help in optimizing the function which is slowing your code down.

At the same time, it is not reasonable to “compare apples to apples” in that sense: on the one hand, if you keep your code as the same series of vectorized operations, i. e. if the python code is already just a bunch of vectorized calls to library functions in all performant critical parts, then there is no reason for Julia to be faster. Julia will be similar in speed if these vector operations are made “correctly” from the Julia point of view as well (mainly I mean that you have to take care of using @views and fusing broadcasted operations; there are some situations where you may be getting that from the python syntax and not from the Julia one as it is).

But to really take advantage of Julia you need to use the structure of your problem and operations and liberate your self from the constraint of having to write everything as a series vectorized operations. Then, if your problem has specific characteristics that allow optimization, the Julia code can be much faster. The possibility of fusing many vector operations in a loop, avoiding a lot of intermediate vectors, is one of those possibilities.

ToucheSir · September 12, 2021, 6:23pm

I took the liberty of tweaking the code to be as close to the updated Python implementation at https://github.com/MichalDanielDobrzanski/DeepLearningPython/blob/master/network.py as possible:

using MLDatasets, Random, BenchmarkTools

function vectorized_result(y)
    e = zeros(10)
    e[y + 1] = 1
    return e
end

process_sample(x, y) = collect(reshape(x, 28 * 28, 1)), vectorized_result(y)


sigmoid(z) = 1 / (1 + exp(-z))

function sigmoid_prime(z)
    s = sigmoid(z)
    return s * (1 - s)
end

function predict(a, biases, weights)
    for (b, w) in zip(biases, weights)
        a = sigmoid.(w * a .+ b)
    end
    return vec(a)
end

function evaluate(test_data, biases, weights)
    test_results = [(argmax(predict(x, biases, weights)), y) for (x, y) in test_data]
    return sum(y[pred] for (pred, y) in test_results)
end

cost_prime(output, y) = output - y

function backpropagation(x, y, biases, weights, nabla_b, nabla_w)
    nabla_b = [zeros(size(b)) for b in biases]
    nabla_w = [zeros(size(w)) for w in weights]
    a = x
    as = [a]
    zs = Matrix{Float64}[]
    for (b, w) in zip(biases, weights)
        z = w * a + b
        push!(zs, z)
        a = sigmoid.(z)
        push!(as, a)
    end
    delta = cost_prime.(as[end], y) .* sigmoid_prime.(zs[end])
    nabla_b[end] = delta
    nabla_w[end] = delta * as[end - 1]'
    for l in 1:length(zs) - 1
        delta = weights[end - l + 1]' * delta .* sigmoid_prime.(zs[end - l])
        nabla_b[end - l] = delta
        nabla_w[end - l] = delta * as[end - l - 1]'
    end
    return nabla_b, nabla_w
end

function update_mini_batch(minibatch, eta, biases, weights)
    nabla_b = [zeros(size(b)) for b in biases]
    nabla_w = [zeros(size(w)) for w in weights]
    for (x, y) in minibatch
        delta_nabla_b, delta_nabla_w = backpropagation(x, y, biases, weights, nabla_b, nabla_w)
        nabla_b = [nb + dnb for (nb, dnb) in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw + dnw for (nw, dnw) in zip(nabla_w, delta_nabla_w)]
    end
    biases  = [b - (eta / length(minibatch)) * nb for (b, nb) in zip(biases, nabla_b)]
    weights = [w - (eta / length(minibatch)) * nw for (w, nw) in zip(weights, nabla_w)]
    return biases, weights
end

function SGD(train_data, epochs, mini_batch_size, eta, biases, weights, test_data)
    n_train = length(train_data)
    for j in 1:epochs
        shuffle!(train_data)
        train_minibatches = [train_data[k:k + mini_batch_size - 1] for k in 1:mini_batch_size:n_train]
        for minibatch in train_minibatches
            (biases, weights) = update_mini_batch(minibatch, eta, biases, weights)
        end
        res = evaluate(test_data, biases, weights)
        print("Epoch $j: $res / $(length(test_data))\n")
    end
    return biases, weights
end

let
    train_x, train_y = MNIST.traindata(Float64, 1:50_000)
    train_data = process_sample.(eachslice(train_x, dims=3), train_y)

    valid_x, valid_y = MNIST.traindata(Float64, 1:50_000)
    valid_data = process_sample.(eachslice(valid_x, dims=3), valid_y)

    test_x, test_y = MNIST.testdata(Float64)
    test_data = process_sample.(eachslice(test_x, dims=3), test_y)
    layer_sizes = [784, 30, 10]
    biases = [randn(l, 1) for l in layer_sizes[2:end]]
    weights = [randn(ll, l) for (l, ll) in zip(layer_sizes[1:end - 1], layer_sizes[2:end])]

    @time SGD(train_data, 5, 10, 3.0, biases, weights, test_data);
end

After warmup, this runs in ~20.5s (-1 or 2s if you warmup) vs ~14.5s for the Python version. As expected, it also uses all cores on my machine. So still a large gap, but nowhere near 2.5x.

Now as @lmiq mentioned, profiling is a good way to find bottlenecks with the code. If you toss this routine under a profiler (VSCode has a nice visual one):

# Warmup for a cleaner profile
SGD(train_data, 1, 10, 3.0, biases, weights, test_data);
@profview SGD(train_data, 5, 10, 3.0, biases, weights, test_data);

The flamegraph shows mostly what you’d expect: matrix multiplication (mul! and gemm) and pointwise operations (+ and *) that use broadcasting under the hood. What does stand out is the leftmost group: ~4s of computation is spent on zeros -> fill! -> setindex!. Seems weird that zeroing out an array would be on the hot path, but let’s try not doing it and seeing what happens:

# nabla_b = [zeros(size(b)) for b in biases]
# nabla_w = [zeros(size(w)) for w in weights]
nabla_b = [similar(b) for b in biases]
nabla_w = [similar(w) for w in weights]

# nabla_b = [np.zeros(b.shape) for b in self.biases]
# nabla_w = [np.zeros(w.shape) for w in self.weights]
nabla_b = [np.empty(b.shape) for b in self.biases]
nabla_w = [np.empty(w.shape) for w in self.weights]

Now the Julia version is ~15s after warmup. Paradoxically, the Python version is ~1.5s slower! Armed with this information, we can run a direct comparison of both zeros functions:

In [1]: %timeit x = np.zeros((1024, 1024), dtype=np.float64)
285 µs ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

julia> @benchmark x = zeros(Float64, 1024, 1024)
BenchmarkTools.Trial: 7380 samples with 1 evaluation.
 Range (min … max):  567.270 μs …   1.527 ms  ┊ GC (min … max):  0.00% … 58.69%
 Time  (median):     588.213 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   674.429 μs ± 182.927 μs  ┊ GC (mean ± σ):  11.89% ± 16.61%

  █                                                              
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  608 μs           Histogram: frequency by time          592 μs <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

Given how often both the backprop and batch update functions are called, this adds up quickly!

Looking through both implementations, I think the difference is that Numpy uses calloc (which can be lazy) for its zeros implementation, whereas Julia eagerly fills with zeros. There is an issue open for Julia to use calloc for zeros as well, but AFAICT that hasn’t be implemented.

With the objective results out of the way, please allow me a bit of a rant on this. The original Python tutorial implementation is, for lack of a better term, horrendously unoptimized. I understand the need to keep things clear for pedagogical value, but most of the choices made don’t help there or even obfuscate the underlying behaviour. Just a couple examples to hammer home this point:

Instead of creating dense minibatches of shape Nx784, it gloms together N 784x1 samples into one list and runs them through the network one by one. This takes away a lot of the efficiency gains from using BLAS, and it’s not even any clearer because there’s that random unexplained batch dimension in every input in addition to the batch dimension created by the vector length. Had they just used a proper 2d array, this would be a non-issue.
A complete allergy to in-place operations. This:

self.weights = [w-(eta/len(mini_batch))*nw
                for w, nw in zip(self.weights, nabla_w)]

Is not strictly more legible or comprehensible than this:

for w, nw in zip(self.weights, nabla_w):
    w -= eta/len(mini_batch))*nw

Which has the benefit of allocating less. I’d argue it’s more clear for nabla_[bw], as it clearly signals the values are being updated/accumulated into and instead of just redefined.

Useless pre-allocation. This is really the crux of the matter, because it accounts for almost the entire discrepancy between Python and Julia **. For lack of a better term, these two lines are useless because the indices are completely overwritten afterwards. What ought to happen is that update_mini_batch passes the nablas into backprop, which accumulates into them in-place. Not only does that establish a clear relationship between both sets of variables, but it dramatically reduces the amount of garbage generated (for reference, @time reports a staggering 150 GB).

Now complaining about a free thing someone poured their time into on the internet is old hat, but I find it worrying that a tutorial like this still has reach despite many better implementations existing on the internet (the number of search results for “numpy neural network MNIST” is overwhelming) and being stuck on an out-of-date language (Python 2).
So thank you for taking the time to port over this code and post about it. I think the 2 biggest takeaways are a) someone should revisit calloc for zeros, and b) I’m making that github repo/book an explicit anti-recommendation when anybody asks about ML tutorial resources

** because almost all of the performance comes from optimized BLAS and pointwise operation routines, you’d be unlikely to see a significant difference between Python, Julia, or any other language that can make use of the above. Conversely, a language with good native codegen shines when such routines do not exist and you need to write them yourself: doing something like https://github.com/JuliaOptimalTransport/OptimalTransport.jl/blob/master/src/quadratic_newton.jl with Numpy efficiently would be quite “fun”.

way2toronto · September 13, 2021, 2:49pm

Thanks for the high-level advice. Appreciate the feedback and that’s something will help me learn Julia.
Just to clarify my reference to “apples to apples”, I meant to be comparing similar structure/logic of the code to prevent over-optimizing one or the other, and to keep the implementation effort of about the same. At the same time, as a new user of Julia, all your points are very valuable. Thank you.

way2toronto · September 13, 2021, 3:09pm

Thanks a lot for taking the time to look at the code and suggestions around particular points. I have implemented you version (with minor changes) and I still see 2.1x underperformance. Not sure what is going on. I will post the full versions of the code below, but here are the main points:

I’ve changed the Julia parts on nabla_b, nabla_w – those that are used in backprop (agreed, that they are overwritten and can be initialized by any values for speed) are different from the ones in the update_minibatch (these need to be initialized as 0s since they are being accumulated and they cannot be passed to backrop to avoid being modified at the parent level, otherwise the results will be wrong, e.g. the accuracy on the test results should be at least 9300/10000 after about 4-5 epochs). My way around the problem (see the code below) was to eliminate nablas in the parent function and add accumulated delta_nablas directly to weights. While this works in this particular case, it may not be optimal in more complex deep learning algos, but to eliminate the problem with zeros() that you found, I’ve implemented it that way. Julia version is 1.6.2. and is being run from VSCode. Runtime is 170s for 10 epochs.
I am using Python 3 for my version of the Python code (3.9.6), which is being run from Spyder 5.0.5. Runtime is 80s for 10 epochs
CPU utilization (fluctuates around 25% for Julia and 70% for Python)

Julia code

using MLDatasets, Random, BenchmarkTools

function vectorized_result(y)
    e = zeros(10)
    e[y + 1] = 1
    return e
end

process_sample(x, y) = collect(reshape(x, 28 * 28, 1)), vectorized_result(y)

sigmoid(z) = 1 / (1 + exp(-z))

function sigmoid_prime(z)
    s = sigmoid(z)
    return s * (1 - s)
end

function predict(a, biases, weights)
    for (b, w) in zip(biases, weights)
        a = sigmoid.(w * a .+ b)
    end
    return vec(a)
end

function evaluate(test_data, biases, weights)
    test_results = [(argmax(predict(x, biases, weights)), y) for (x, y) in test_data]
    return sum(y[pred] for (pred, y) in test_results)
end

cost_prime(output, y) = output - y

function backpropagation(x, y, biases, weights)
    delta_nabla_b = [similar(b) for b in biases]
    delta_nabla_w = [similar(w) for w in weights]
    a = x
    as = [a]
    zs = Matrix{Float64}[]
    for (b, w) in zip(biases, weights)
        z = w * a + b
        push!(zs, z)
        a = sigmoid.(z)
        push!(as, a)
    end
    delta = cost_prime.(as[end], y) .* sigmoid_prime.(zs[end])
    delta_nabla_b[end] = delta
    delta_nabla_w[end] = delta * as[end - 1]'
    for l in 1:length(zs) - 1
        delta = weights[end - l + 1]' * delta .* sigmoid_prime.(zs[end - l])
        delta_nabla_b[end - l] = delta
        delta_nabla_w[end - l] = delta * as[end - l - 1]'
    end
    return delta_nabla_b, delta_nabla_w
end

function update_mini_batch(minibatch, eta, biases, weights)
    for (x, y) in minibatch
        delta_nabla_b, delta_nabla_w = backpropagation(x, y, biases, weights)
        biases  = [b - (eta / length(minibatch)) * nb for (b, nb) in zip(biases, delta_nabla_b)]
        weights = [w - (eta / length(minibatch)) * nw for (w, nw) in zip(weights, delta_nabla_w)]        
    end
    return biases, weights
end

function SGD(train_data, epochs, mini_batch_size, eta, biases, weights, test_data)
    n_train = length(train_data)
    for j in 1:epochs
        shuffle!(train_data)
        train_minibatches = [train_data[k:k + mini_batch_size - 1] for k in 1:mini_batch_size:n_train]
        for minibatch in train_minibatches
            (biases, weights) = update_mini_batch(minibatch, eta, biases, weights)
        end
        res = evaluate(test_data, biases, weights)
        print("Epoch $j: $res / $(length(test_data))\n")
    end
    return biases, weights
end


let
    train_x, train_y = MNIST.traindata(Float64, 1:50_000);
    train_data = process_sample.(eachslice(train_x, dims=3), train_y);

    valid_x, valid_y = MNIST.traindata(Float64, 50_001:60_000);
    valid_data = process_sample.(eachslice(valid_x, dims=3), valid_y);

    test_x, test_y = MNIST.testdata(Float64);
    test_data = process_sample.(eachslice(test_x, dims=3), test_y);

    layer_sizes = [784, 30, 10];
    biases = [randn(l, 1) for l in layer_sizes[2:end]];
    weights = [randn(ll, l) for (l, ll) in zip(layer_sizes[1:end - 1], layer_sizes[2:end])];

    @time SGD(train_data, 10, 10, 3.0, biases, weights, test_data);
    #@profview SGD(train_data, 1, 10, 3.0, biases, weights, test_data);
end

Python code

import random
import numpy as np

class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a) + b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [training_data[k:k+mini_batch_size] for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            print("Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test))

    def update_mini_batch(self, mini_batch, eta):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)]
        self.biases  = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        nabla_b = [np.empty(b.shape) for b in self.biases]
        nabla_w = [np.empty(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        for l in range(2, self.num_layers):
            z = zs[-l]
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sigmoid_prime(z)
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        test_results = [(np.argmax(self.feedforward(x)), y) for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        return (output_activations-y)

def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    s = sigmoid(z)
    return s*(1-s)





"""
mnist_loader
"""

import skimage
import skimage.transform
import numpy as np
from tensorflow.keras.datasets import mnist

def load_mnist(target_height=64, target_width=64):
    (x_train, y_train), (x_test, y_test) = mnist.load_data()

    def preprocess_mnist(df, target_height, target_width):
        df = [skimage.transform.resize(x, (target_height, target_width)).astype(np.float64) for x in df]
        return df

    x_train = preprocess_mnist(x_train, target_height, target_width)
    x_test = preprocess_mnist(x_test, target_height, target_width)
    return x_train, y_train, x_test, y_test

def load_data():
    x_train, y_train, x_test, y_test = load_mnist(target_height=28, target_width=28)
    training_data = (x_train[:50000], y_train[:50000])
    validation_data = (x_train[50000:], y_train[50000:])
    test_data = (x_test, y_test)
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = list(zip(training_inputs, training_results))
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = list(zip(validation_inputs, va_d[1]))
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = list(zip(test_inputs, te_d[1]))
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

import time

training_data, validation_data, test_data = load_data_wrapper()

net = Network([784, 30, 10])
tic = time.perf_counter()
net.SGD(training_data, 10, 10, 3.0, test_data=test_data)
toc = time.perf_counter(); print(toc-tic)

way2toronto · September 13, 2021, 3:10pm

And for Python
PythonCPU

DNF · September 13, 2021, 3:23pm

At least keep in mind that if ‘similar structure’ means ‘heavy reliance on vectorization’, then you might be giving up on what is the main benefit of Julia, which is that you don’t need to contort your problem to fit a vectorized workflow.

Without looking thoroughly at your code, it is conceivable (though not certain) that less vectorized code would be both shorter, simpler and faster. For that reason I’m a bit skeptical about keeping things ‘apples to apples’, since a style that is optimal in one language is suboptimal in another. It is better to strive for ‘similar effort’ as well as ‘simple and idiomatic’.

way2toronto · September 13, 2021, 3:29pm

Yeah, that I can agree with.
And I’d love see a less vectorized code that is shorter and faster!

ToucheSir · September 13, 2021, 4:22pm

Wait, are you running this using the VSCode debugger (i.e. F5)? Please run from the REPL or CLI instead. The debugger will not compile user code by default so that you can set breakpoints, but it should not be used for benchmarking (just like you wouldn’t use the Python debugger for benchmarking).

Also, please post the full output of @time and not just the execution time. This is because we can see how much of a factor GC might be in the runtime **. Make sure you’ve run every function in the callstack at least once before timing too, calling SGD for one epoch should be enough. That eliminates any up-front compilation overhead and will give the best picture of runtime performance. ***

** I ran on a machine with 32GB of RAM. Yours only has 12, so there may be more pressure on the GC.

*** You can still time the full compile/runtime separately, but that won’t be a proper apples-to-apples comparison with Python runtime perf (especially considering Numpy etc. are all precompiled).

Topic		Replies	Views
Julia or Python: Which is faster for Deep Learning Performance	3	1629	May 7, 2020
General questions from Python user Performance	59	4245	March 8, 2021
Julia vs. Python Performance on package catch22: Why is Julia slower in this case and how can I improve it? Performance question	11	919	December 16, 2024
Performance input on my code Performance	8	837	October 15, 2018
Python vs. Julia ODE Solver Performance python	30	4364	March 3, 2021

Julia Performance - Help Needed

Related topics