Speed Comparison Python v Julia for custom layers

New to Julia and looking at comparisons of speed in forward pass with pytorch. I want to optimise performance of training a neural network that will contain custom layers. I have provide some code for an example. I want to know how fast I can speed up my Julia code. Currently these two implementations clock in at – Python: 146 μs – Julia: 34.363 μs – which suggests that Julia is much faster. Please can someone help me to fairly compare the two languages for speed. Is it fairer to compare backward pass as well? I am primarily interested in speed for inference and secondly memory overhead, which is why I didn’t include any optimisation of the network in the examples. Also would these results change as the layers become sparser or as the size of the layers scale?

using Flux
using Statistics
using LinearAlgebra
using GraphNeuralNetworks

struct CustomLayer <: GNNLayer
    Wq::AbstractMatrix
    Wk::AbstractMatrix
    Wv::AbstractMatrix
    dk::Float64
    σ::typeof(softmax)

end

Flux.@functor CustomLayer

function CustomLayer(n_in::Int, n_out::Int, n_neighbors::Int)
    Wq = Flux.glorot_uniform(n_in, n_neighbors)
    Wk = Flux.glorot_uniform(n_in, n_neighbors)
    Wv = Flux.glorot_uniform(n_in, n_out)
    dk = √n_in
    σ = softmax
    CustomLayer(Wq, Wk, Wv, dk, σ)
end

function (m::CustomLayer)(xi, xj)
    return transpose(m.σ(((xi * m.Wq) * transpose(xj * m.Wk)) ./ m.dk, dims=2) * (xj * m.Wv))
    
end


myLayer = CustomLayer(1000, 10, 20)

function test_function()
    xi = rand(Float32, 1, 1000)
    xj = rand(Float32, 10, 1000)
    out = myLayer(xi, xj)
    return
end

@btime test_function()
import torch
import torch.nn as nn
import timeit
import numpy as np

class CustomLayer(nn.Module):
    def __init__(self, n_in, n_out, n_neighbors):
        super().__init__()
        self.Wq = nn.Linear(n_in, n_neighbors)
        self.Wk = nn.Linear(n_in, n_neighbors) 
        self.Wv = nn.Linear(n_in, n_out)
        self.dk = np.sqrt(n_in)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, xi, xj):
        attention = self.softmax(self.Wq(xi) @ self.Wk(xj).T / self.dk)
        return attention @ self.Wv(xj)
    

myLayer = CustomLayer(1000, 10, 20)

def test_function():
    xi = torch.rand(1, 1000)
    xj = torch.rand(10, 1000)

    out = myLayer(xi, xj)
    return

number_exc = 10000
print(timeit.timeit("test_function()", setup="from __main__ import test_function", number=number_exc) / number_exc)

2 Likes

Just some general remarks:

  • You should avoid abstractly typed fields in structs, i.e., use type parameters instead:

     struct CustomLayer{Q,K,V} <: GNNLayer
         # If you know/want that all fields have the same type, use a single type parameter
         Wq::Q
         Wk::K
         Wv::V
         # the remaining fields are concretely typed already
         dk::Float64
         σ::typeof(softmax)  # with this type, the field can only hold the softmax function anyways
     end
    
  • The closest analog to a linear layer, i.e., nn.Linear, would be Dense(n_in => n_out, identity)

  • Torch has row-major arrays, whereas Julia uses column-major. Accordingly, Torch batches along the first dimension and Flux along the last. It’s often easiest (and fastest) to translate code from Python to Julia by simply reversing all tensor dimensions, i.e. xi = rand(Float32, 1000, 1) and m.Wk(xj)' * m.Wq(xi) etc.

4 Likes

I think it would be fairer to measure without the random number generation, although it is probably negligible.

In this use case I think you’re okay with just the forward pass, at least if you work with a fully trained network from the start.

If you want to get peak performance from your Julia code, you may want to think about allocations. Maybe there is a way to reuse some storage in this computation?
Of course this will also mean that the backward pass with Zygote will fail due to mutation, but if you don’t need it who cares

You should compile the pytorch code… otherwise I think you’re leaving a lot of performance on the table. See the docs for how to do that

@gdalle @datnamer @bertschi tl;dr: used your advice but haven’t found a speed up yet. Best Julia: 131.717 μs, Best python: 151.624μs (My bad reported wrong in OG post).

@gdalle

  • below is implementation for Julia without random number generator
  • I actually am interested in backwards pass but I wanted to know what the forward passes clocked in at first

@datnamer

  • I have used the torch.compile() on my model but this weirdly didn’t improve performance
  • 151.6243975μs (w/o compiled)
  • 609.6357μs (w/ compiled)

@bertschi Thanks for the info. I’ve changed the struct to take in type parameters.

Here’s a test for a single Flux Dense layer vs just multiplying the arrays. Dense layer wins.

Wq = Flux.glorot_uniform(n_in, n_neighbours)
Wq_dense = Flux.Dense(n_in => n_neighbours, identity; bias=false, init=Flux.glorot_uniform)
xi = rand(Float32, n_in, 1)
xii = rand(Float32, 1, n_in)

@btime Wq_dense(xi) : 5.329 μs (2 allocations: 288 bytes)
@btime xii * Wq : 6.703 μs (1 allocation: 144 bytes)

For the whole layer, difference is still small.
xj = rand(Float32, n_in, n_neighbours)
xjj = rand(Float32, n_neighbours, n_in)
@btime (Wv_dense(xj)) * (Wk_dense(xj) * Wq_dense(xi)): 128.563 μs (8 allocations: 5.80 KiB)
@btime (xii * Wq) * (xjj * Wk) * (xjj * Wv): 136.162 μs (5 allocations: 3.02 KiB)

For the whole implementation my current implementation w/o the dense layers is quicker…
@btime test_function() : 131.717 μs (14 allocations: 5.42 KiB)
@btime test_function_dense() : 133.867 μs (18 allocations: 10.58 KiB)

using Flux
using Statistics
using LinearAlgebra
using GraphNeuralNetworks
using BenchmarkTools

struct CustomLayer{L} <: GNNLayer
    Wq::L
    Wk::L
    Wv::L
    dk::Float64
    σ::typeof(softmax)

end

Flux.@functor CustomLayer

function CustomLayer(n_in::Int, n_out::Int, n_neighbors::Int)
    Wq = Flux.glorot_uniform(n_in, n_neighbors)
    Wk = Flux.glorot_uniform(n_in, n_neighbors)
    Wv = Flux.glorot_uniform(n_in, n_out)
    dk = √n_in
    σ = softmax
    CustomLayer(Wq, Wk, Wv, dk, σ)
end

function (m::CustomLayer)(xi, xj)
    return transpose(m.σ(((xi * m.Wq) * transpose(xj * m.Wk)) ./ m.dk, dims=2) * (xj * m.Wv))
    
end


struct CustomDense{Q} <: GNNLayer
    Wq::Q
    Wk::Q
    Wv::Q
    dk::Float64
    σ::typeof(softmax)

end

Flux.@functor CustomDense

function CustomDense(n_in::Int, n_out::Int, n_neighbours::Int)
    Wq = Flux.Dense(n_in => n_neighbours, identity; bias=false, init=Flux.glorot_uniform)
    Wk = Flux.Dense(n_in => n_neighbours, identity; bias=false, init=Flux.glorot_uniform)
    Wv = Flux.Dense(n_in => n_out, identity; bias=false, init=Flux.glorot_uniform)
    dk = √n_in
    σ = softmax
    CustomDense(Wq, Wk, Wv, dk, σ)
end

function (m::CustomDense)(xi, xj)
    return m.σ(m.Wv(xj) * m.Wk(xj) ./ m.dk, dims=2) * m.Wq(xi)
end


n_in = 1000
n_out = 10
n_neighbours = 20

myLayer = CustomLayer(n_in, n_out, n_neighbours)
xi = rand(Float32, 1, n_in)
xj = rand(Float32, n_neighbours, n_in)
myDense = CustomDense(n_in, n_out, n_neighbours)

function test_function()
    out = myLayer(xi, xj)
    return
end

function test_function_dense()
    out = myDense(xi', xj')
    return
end

Would not expect Dense to be faster than doing the matrix multiplication yourself – in the end, its just doing the same thing under the hood anyways as you can check with @less Wq_dense(xi) from the REPL. Just thought that it’s closer to using a linear layer in Python as this should include a bias as well.

Also, be careful when using a single type parameter in your struct CustomDense{Q} as this requires the fields Wq, Wk and Wv to have the same type. Dense also stores the activation function in its type and thus, you cannot use different ones across your fields – it’s of course fine if you won’t ever need/want that.

Oh for the speed comparison? The main thing I’m interested in is just how fast I can make this layer… (while not messing up the backward pass)

Without a bias in the nn.linear() python is faster (without torch.compile).
Python: 87.9 μs
Julia: 128.3 μs

That’s to be expected since Flux is still doing the work of applying a null bias and no-op activation function.

To your original question, it may be worth also comparing backward pass timings unless you’re planning on loading pre-trained weights from somewhere else. It would help if you could share a little more about your real use case (e.g. the full networks you’re trying to run, whether you need GPU support, how important training speed is if at all).

Is not this sort of failed promiss of Zygote? I would expect that null operation would be compiled away.

That’s to be expected since Flux is still doing the work of applying a null bias and no-op activation function.

The faster Julia implementation is the one without the dense() layers and instead using the abstract matrices and just using flux to initialise the parameters. This means that it’s not doing a null bias operation and the only activation function applied for python and Julia is the softmax.

To your original question, it may be worth also comparing backward pass timings unless you’re planning on loading pre-trained weights from somewhere else

I can compare backwards as well because I am interested in both forward as well as forward + backwards. Any optimisation suggestions would have to account for the backwards pass but I am interested in forward time execution as well.

the full networks you’re trying to run

I’m not going to share the full networks I want to run. I want to know how one would speed this particular layer. I understand there are different ways to speed up the overall network by taking advantage of parallelism’s based on overall structure. But, in this question, I am only interested in how one would speed up this layer (if you can think of a very similar layer that does naturally have an easy way to speed up then I would be interested in this as well).

whether you need GPU support

I want to know CPU speeds. I’m not familiar with the differences between parallelism on GPU vs CPU but my interest is whether what benchmark performance on a CPU I will get.

how important training speed is if at all

Training speed is variably important. It depends on the ratio of relative inference speed up. I could say inference speed is 10x more important. Having said that, I would prefer to see all solutions to inference speed up as long as the backwards pass still works.

1 Like

Zygote does the automatic differentiation? In this example I’m not using any backwards pass functionality so I wouldn’t have thought any gradient computation is taking place and therefore affecting the speed up. Are you saying using Zygote should compile away bits of the forward pass if it’s redundant?

Sorry, my bad, no. It should do it for reverse. But I am anyway surprised that the compiler does not compile away redundant operations, because this is something what it should do. May-be, defining a special forward pass for Dense without activation would do the job.

function (a::Dense{<:Any, <:Any, ::Bool})(x::AbstractVecOrMat)
  _size_check(a, x, 1 => size(a.weight, 2))
  σ = NNlib.fast_act(a.σ, x)  # replaces tanh => tanh_fast, etc
  xT = _match_eltype(a, x)  # fixes Float64 input, etc.
  return σ.(a.weight * xT)
end

If the difference is so substantial, it might be interesting to add it.

1 Like

If you don’t need to train at all, there are a lot of optimizations which can be done. For example, using in-place multiplication (mul!) and broadcasting. Likewise if you don’t care about GPU support, libraries like
GitHub - PumasAI/SimpleChains.jl: Simple chains are available and quite performant.

1 Like

Broadcasting identity is not considered a no-op, nor is broadcasting adding false.

Why? Is it for safety reasons, or limitation of the compiler?

Not sure. I know it can change the output array type, but I don’t recall where such semantics have been documented/discussed before.

If the difference is so substantial, it might be interesting to add it.

I will add this into the mix for testing speed.

SimpleChains.jl

This seems like a good option. If I can just design layers in a similar way, with the only limitation being GPU support and I can still optimise then this would be ideal. I will test this, are the docs the best place for picking this up.

Broadcasting identity is not considered a no-op, nor is broadcasting adding false.

Personally not sure what this means? How should one use these bits of information?

1 Like

@Tomas_Pevny I tried your function but the type assertions following Dense were giving me and error. I then realised there isn’t a fast_softmax in NNlib. Also I don’t really get how this would drop in to my code?

@ToucheSir I’ve read this blog on simple chains: Doing small network scientific machine learning in Julia 5x faster than PyTorch but realised that this processes each layer once after the other. To implement my layer with simplechains either I need to have multiple different chains, which seems like a complicated way of doing this. Or I can implement my own layer in simplechains but it seems like to do this I would basically just be answering my question. It seems like using the optimisations discussed in the blog are in the direction that I want, but I don’t know where to start doing this.

Despite what some libraries say, softmax is not an activation function in the same way that e.g. relu is. It operates on an entire array at once. Hence it’s a separate operation/function which is not incorporated into the previous layer’s forward pass. NNlib’s softmax function is already “fast” in that sense because it already does some degree of numerical approximation.

All ML libraries do this, because each layer depends on the output of the previous one. I can’t think of a model where that isn’t the case (note: even skip connections and other branching all runs sequentially), which is another reason for you to share more about the one you’re trying to write :wink:

1 Like