# How to make my Julia code fast like MATLAB?

Hi All, I am new to Julia and really struggling to understand what is making my code which convolutes a signal with a Gaussian or Lorentzian so slow. I have the equivilent MATLAB code that runs in 4 seconds when vectorised properly. In Julia, it takes:

``````368.047741 seconds (19.37 G allocations: 288.664 GiB, 14.30% gc time, 0.03% compilation time)
``````

This is regardless of me using a triple nested for loop to calculate the sum, a double nested loop and the inbuilt `sum()` function, or replacing the nested loops with a multiplication of the row of every signal with the circulant kernel matrix - either using `*` or `mul!()` are both slow. It is probably worth mentioning that the signal array is size `(3801, 670)` after padding the orginal signal and the kernel is size `(3801, 3801)`.

The code is as follows, please feel free to tear into it…

``````using MAT
using SpecialMatrices

wd = "/path/to/files"

inputfile = matopen(wd*"/AllTraj_Signal.mat")
close(inputfile)
replace!(signal, NaN=>0)
signal = signal[2:671, 1:2001]
signal[:, 1] = signal[:, 2]

inputfile = matopen(wd*"/Vectors.mat")

struct Gaussian
Fx:: Array{Float64}
sigma:: Float64
height:: Float64
fwhm:: Float64
tconv:: StepRangeLen{Float64}
ntc:: Int64
duration:: Float64
dt:: Float64
tmin :: Float64
tmax:: Float64
nt:: Int64
function Gaussian(fwhm:: Float64, x:: StepRangeLen{Float64}, x0:: Float64, tmin:: Float64, tmax::Float64, nt:: Int64)
duration = fwhm*3
sigma = fwhm/2sqrt(2log(2))
height = 1/sigma*sqrt(2pi)
Fx = height .* exp.( (-1 .* (x .- x0).^2)./ (2*sigma^2))
Fx = Circulant(Fx)
tconv = x; ntc = length(x)
dt = tconv-tconv
nt = nt
new(Fx, sigma, height, fwhm, tconv, ntc, duration, dt, tmin, tmax, nt)
end
end

struct Lorentzian
Fx:: Array{Float64}
gamma:: Float64
fwhm:: Float64
tconv:: StepRangeLen{Float64}
ntc:: Int64
duration:: Float64
dt:: Float64
tmin :: Float64
tmax:: Float64
nt:: Int64
function Lorentzian(fwhm:: Float64, x:: StepRangeLen{Float64}, x0:: Float64, tmin::Float64, tmax:: Float64, nt:: Int64)
gamma = fwhm/2
duration = fwhm*3
Fx = 1 ./ ((pi * gamma) .* (1 .+ ((x .- x0)./ gamma).^2 ))
Fx = Circulant(Fx)
tconv = x; ntc = length(x)
dt = tconv-tconv
nt = nt
new(Fx, gamma, fwhm, tconv, ntc, duration, dt, tmin, tmax, nt)
end
end

function generate_kernel(fwhm:: Float64, time:: StepRangeLen{Float64}, Flag:: Int)

dt:: Float64 = sum(diff(time, dims=1))/(length(time)-1)
nt:: Int64 = length(time)
duration:: Float64 = fwhm*3 # must pad three time FHWM to deal with edges
if dt != diff(time, dims=1) ; error("Non-linear time range."); end

tmin:: Float64 = minimum(time); tmax:: Float64 = maximum(time)
tconv_min:: Float64 = tmin - duration ; tconv_max:: Float64 = tmax + duration
tconv = tconv_min:dt:tconv_max # extended convoluted time vector

if Flag == 0
K = Gaussian(fwhm, tconv, tconv, tmin, tmax, nt)
elseif Flag == 1
K = Lorentzian(fwhm, tconv, tconv, tmin, tmax, nt)
else
error("Only Gaussian and Lorentzian functions currently available.")
end
return K
end

function convolve(S:: Array{Float64}, K:: T) where {T<:Union{Gaussian, Lorentzian}}

nr:: Int64, nc:: Int64 = size(S)
if nr != K.nt && nc != K.nt; error("Inconsistent dimensions."); end
if nc != K.nt; S = copy(transpose(S)); nr, nc = nc, nr; end # column always time axis

padend:: Int64 = (K.duration/K.dt) # index at which 0 padding ends and signal starts

println("Convolving signal with a \$(typeof(K)) function with FWHM: \$(K.fwhm) fs.")

sC = zeros(Float64, nr, K.ntc)
conv = zeros(Float64, nr, K.ntc)

for t in 1:K.ntc  # K.ntc = 3801
for q in 1:nr  # nr = 670
#conv[q, t] = sum(collect(sC[q, 1:K.ntc]) .* collect(K.Fx[t, :]))*K.dt  # usiing sum function method
for j in 1:K.ntc
conv[q, t] += sC[q, j] * K.Fx[t, j] * K.dt  # convolute with kernel
end
end
end

return conv
end

fwhm = 100.0
tvec = 0:0.5:1000
kernel = generate_kernel(fwhm, tvec, 0)
conv = convolve(signal, kernel)
``````

I have tried to define all my types and split up the generation of the kernel into two types and keep the actual convolution seperate from this, but it has not helped speed wise. Any tips or insights would be greatly appreciated! A couple notes. For one, you are creating a nice circulant matrix, but `Fx::Array{Float64}` is not only throwing away that nice structure I think by converting to an `Array`, but it is actually an abstract type, so that’s not ideal.

Also, while I could be missing something, I think if you actually kept the structure of the circulant matrix, just doing `struct.Fx*...` would be very fast and effective because then you’d actually be using the structure of the circulant matrix.

EDIT: Oh, and one more thing—maybe there’s some reason you’re doing it this way, but `DSP.jl` I think provides some nicely tuned and generic convolution functions and stuff, so there are plenty of very convenient ways to do this without ever building a matrix or anything.

5 Likes

Since I don’t have your input files, I can obviously not run the code (it would be better to submit a MWE that does not need those files), but I will try to give some suggestions anyway.

Most of the type annotations here do not affect performance at all. The only place where you need the types is in the declaration of the struct fields, and in that case it is crucial for performance that they are concrete types. Some of the fields in your structs are abstractly typed (`Array{Float64}` and `StepRangeLen{Float64}` are not concrete). If you cannot specify them completely, you can use a parametric struct, e.g.:

``````struct Gaussian{T<:Circulant, S<:StepRangeLen{Float64}}
Fx::T
sigma::Float64
height::Float64
fwhm::Float64
tconv::S
ntc::Int64
duration::Float64
dt::Float64
tmin::Float64
tmax::Float64
nt::Int64
end
``````

I didn’t look at the rest of your code yet, this is just the first thing I noticed.

Edit: I didn’t notice at first that you are actually trying to use a special matrix type. In this case, you should of course use `Circulant` instead of `Array` in your struct.

6 Likes

From a cursory look your objects are not type-stable. `Array{Float634}` is not a concrete type.

Either use `Array{Float64,2}` or similar, or just use a type parameter everywhere like A and let them be anything, and you wont have this problem again. Probably the same advice applies to other fields of your objects too, and maybe some code. If the types in hot code are not stable, the performance will not be as good as Matlab.

5 Likes

Thanks! So replacing `Fx:: Array{Float64}` with `Fx:: Circulant{Float64}`? When I am defining structs it is a good idea to be as specific as possible regarding the type then, instead of using Array I should use Matrix, or Vector depending on the situation? I have wrapped the nested for loop in its own function too, this speeds the calculation up to about 60 seconds.

I wanted to do it this way more for my own satisfaction, I like to try understanding a concept by coding it myself first. I shall look into DSP, I imagine it is a lot fast than my code…

Ok great, thanks for the tips! Why would defining the struct as,

``````struct Gaussian{T<:Circulant, S<:StepRangeLen{Float64}}
Fx::T
sigma::Float64
height::Float64
fwhm::Float64
tconv::S
ntc::Int64
duration::Float64
dt::Float64
tmin::Float64
tmax::Float64
nt::Int64
end
``````

be better than,

``````struct Gaussian
Fx::Circulant
sigma::Float64
height::Float64
fwhm::Float64
tconv::StepRangeLen{Float64}
ntc::Int64
duration::Float64
dt::Float64
tmin::Float64
tmax::Float64
nt::Int64
end
``````

The former gave me the following error,
`ERROR: LoadError: syntax: too few type parameters specified in "new{...}"`

I think the most “julian” advice would be to try and write your structs as specifically as possible while still leaving it as generic/flexible as possible. Which sounds sort of contradictory, but as @sostock suggested, what’s nice about something like

``````struct Gaussian{F}
Fx :: F
....
``````

which is a very common pattern in the language, is that if you pass in a special struct that has neat methods for accelerated `Base.:*`, then the `Gaussian` struct will preserve that special object and all of your code that uses `Fx` will benefit. But if for some reason I need to pass in an `Fx` that isn’t a circulant matrix or is somehow incompatible with some part, for example being a type that doesn’t work with the required `fft` under the hood of circulant matrix implementations, then the resulting `Gaussian{Matrix{WeirdType}}` will still be nice and type-stable, so you won’t actually have to change any code.

3 Likes

You could try to switch the order of your `q` and `j` loops. Julia is column-major, so the innermost loop should be the first index.
https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-column-major

4 Likes

Ok, I only tried to do this in a nested loop as the matrix multiplication I was using was very slow. Instead of the nested loop I wanted to multiply a signal vector by the circulant matrix. This is what takes 4 seconds in MATLAB.

``````for q=1:nr
conv[q, :] = (K.Fx * sC[q, 1:end]) .* K.dt
end
``````

Or with no loop,

``````conv = mapslices(x -> K*x, sC; dims=2)
``````

`mul` is also very slow at this. Perhaps I would need to directly use BLAS routines to get comparable speed of matrix multiplication to MATLAB.

I am not following the thread in detail, but this in Julia creates a new array at every iteration of the loop. Use `@view(sC[q,1:end])` or more simply `@view(sC[q,:])`.

You can also put `@views` in front of the function definition (`@views function...`) and then all slices of arrays in the function will be treated as views (not copies).

edit: your Julia code does not have that. sorry

The thing here is that perhaps `Circulant` has parameters. And if it has parameters, the struct is accepting all types of `Circulant`, which makes of it type-unstable, and the compiled functions cannot be specialized to a single type. (I do not know what `Circulant` is, but this is a general concept).

If you use

``````struct Gaussian{T}
Fx::T
``````

When you instantiate your variable of `Gaussiant` type it will specialize to a specific type of `Circulant` used in the specialization.

It is analogous to this, with vectors:

``````julia> struct A
x::Vector
end

julia> struct B{T}
x::T
end

julia> x = A(rand(Float64,3))
A([0.22087576765549888, 0.458109021965067, 0.8571115429656655])

julia> typeof(x)
A

julia> x = B(rand(Float64,3))
B{Vector{Float64}}([0.7203620220131972, 0.6465793604365255, 0.6881096049371642])

julia> typeof(x)
B{Vector{Float64}}
``````

Note that type `B` carries the signature of the type of vector it contains, but `A` does not. That makes all the difference to the compiler, to optimize your code.

For example:

``````julia> x = A(rand(Float64,3))
A([0.7582620326245211, 0.3436082850830309, 0.10922331285448794])

julia> @btime sum(\$x.x)
13.956 ns (1 allocation: 16 bytes)
1.21109363056204

julia> x = B(rand(Float64,3))
B{Vector{Float64}}([0.14043993284712464, 0.6316088798377522, 0.3744204932654418])

julia> @btime sum(\$x.x)
1.918 ns (0 allocations: 0 bytes)
1.1464693059503186

``````

The core of this looks to be the triple loop at the end. You have many implementations, and these have very different performance, perhaps it’s worth gathering them. But what it’s working out is matrix multiplication, and calling the library function is much faster than all these other ways. Making the problem slightly smaller so as not to wait too long:

``````julia> sC = rand(670, 1801);  # I've shrunk 3801 => 1801 to test these in reasonable time

julia> K_Fx = rand(1801, 1801);

julia> using BenchmarkTools

julia> c1 = @btime core1(\$sC, \$K_Fx);  # lots of slices
16.307 s (6033352 allocations: 82.00 GiB)

julia> c2 = @btime core2(\$sC, \$K_Fx);  # simple triple loop
7.585 s (2 allocations: 9.21 MiB)

julia> c3 = @btime core3(\$sC, \$K_Fx);  # matrix-vector loop
854.113 ms (2012 allocations: 37.18 MiB)

julia> c4 = @btime core4(\$sC, \$K_Fx);  # mapslices
785.864 ms (5202 allocations: 27.97 MiB)

julia> c6 = @btime core6(\$sC, \$K_Fx);  # matrix-matrix, one BLAS call
27.140 ms (2 allocations: 9.21 MiB)

# ===== with these definitions: =====

function core1(sC, K_Fx, K_dt = 0.1)
nr, K_ntc = size(sC)
conv = similar(sC)
for t in 1:K_ntc  # K.ntc = 3801
for q in 1:nr  # nr = 670
conv[q, t] = sum(collect(sC[q, 1:K_ntc]) .* collect(K_Fx[t, :]))*K_dt  # usiing sum function method
end
end
conv
end

function core2(sC, K_Fx, K_dt = 0.1)  # version in first message
nr, K_ntc = size(sC)
conv = similar(sC)
for t in 1:K_ntc  # K.ntc = 3801
for q in 1:nr  # nr = 670
for j in 1:K_ntc
conv[q, t] += sC[q, j] * K_Fx[t, j] * K_dt  # convolute with kernel
end
end
end
conv
end

function core3(sC, K_Fx, K_dt = 0.1)
nr, K_ntc = size(sC)
conv = similar(sC)
for q=1:nr
conv[q, :] = (K_Fx * sC[q, 1:end]) .* K_dt
end
conv
end

function core4(sC, K_Fx, K_dt = 0.1)  # adapted
conv = mapslices(x -> K_Fx * x * K_dt, sC; dims=2)
end

function core5(sC, K_Fx, K_dt = 0.1)  # same as 2, with threading + @simd @inbounds
nr, K_ntc = size(sC)
for q in 1:nr  # nr = 670
@simd for j in 1:K_ntc
@inbounds conv[q, t] += sC[q, j] * K_Fx[t, j] * K_dt
end
end
end
conv
end

function core6(sC, K_Fx, K_dt = 0.1)
conv = sC * K_Fx'
conv .*= K_dt
end
``````

Since 7.585 / 0.027 == 281, you might expect the full-size problem to be reduced from 368 seconds to 1.3. And if Matlab is in this range, then probably it’s also writing one matrix multiplication. (Possibly even calling the exact same BLAS library!) Maybe the other parts of the program will now also take an appreciable part of the time.

However, there might still be faster ways to do this. Convolutions can be written using fourier transforms, and I think FFT algorithms scale better with size. Maybe these are packaged up nicely somewhere, maybe DSP.jl is where, not sure.

Lots of good code tips above. Mine would be to delete all the `struct` definitions and just make named tuples to pass groups of parameters etc. around, at least at this stage. You could delete all types except perhaps `generate_kernel(fwhm::Real, time::AbstractVector, flag::Bool)` as reminders to your future self (and to produce less confusing errors).

7 Likes

Yeah, when I time other parts of the code using the time macro it becomes clear the issue is the final loop at the end. I realise I can do this without the structs, this was the first proper programme I have written in Julia and have probably used them where they are not needed. I tried to implement `core2()` and `core3()` previously in my code, but it was really slow. As many have suggested, I guess this is to do with my unstable types. Wrapping the ending loop in its own function also makes a massive difference. One thing I do not really understand is why the `sum()` function is so slow.

I am unsure what you are running on but these functions perform significantly slower for me (running 1.6GHz dual-core i5, 8GB RAM, using Julia 1.6.1). I also fail to see where BLAS is called, I see no `LinearAlgebra.BLAS.`, am I missing something obvious?

``````c2 = @btime core2(\$sC, \$K_Fx);
32.369 s (2 allocations: 9.21 MiB)

c3 = @btime core3(\$sC, \$K_Fx);
1.249 s (2012 allocations: 37.18 MiB)

c4 = @btime core4(\$sC, \$K_Fx);
1.259 s (5202 allocations: 27.97 MiB)

c6 = @btime core6(\$sC, \$K_Fx);
101.725 ms (2 allocations: 9.21 MiB)
``````

I am familiar with the FFT algorithm, although have had some trouble with complex results so far. I will check out DSP for future use, I would just like to code these convolutions myself first.

Thanks for taking the time to benchmark and breakdown these functions, it has been really helpful!

but it was really slow. As many have suggested, I guess this is to do with my unstable types. Wrapping the ending loop in its own function also makes a massive difference.

Yes. It’s totally fine to write type-unstable code which will run once, or when you don’t care. So long as (as you say) the bit which needs to be fast is walled off in its own function.

is why the `sum()` function is so slow.

Each thing like `K_Fx[t, :]` makes a new small array, which the garbage collector will have to clean up. And `.*` makes another. This is what “6033352 allocations: 82.00 GiB” is warning about. You could avoid both by writing `@views dot(sC[q, 1:K_ntc], K_Fx[t, :])`, but I think it still won’t be very quick.

I am unsure what you are running on but these functions perform significantly slower for me

This is on an M1 mac. (I also skipped timing the threaded `core5` because threading seems broken, natively…) But anyway the main point is the relative speeds, which are similar.

fail to see where BLAS is called

Usually `*` ends up calling this, when it can (i.e. when it is given simple arrays of floats). In `core4` the `K_Fx * x` is also I think calling a matrix * vector routine, but fusing all of these into one matrix * matrix allows them to be optimised further.

2 Likes

Yeah @views made some difference, but not a great amount. If `.*` makes a new copy of the array each time then I guess it is better to define a function and apply that to whatever array using dot notation?

So BLAS is called automatically for some specific types of function call.

One thing I have just realised, is that having a Matrix of type Circulant is slowing down my calculations…

``````julia> using SpecialMatrices

julia> using BenchmarkTools

julia> sC = rand(670, 3801);

julia> K_Fx = rand(3801, 3801);

julia> function core6(sC, K_Fx, K_dt = 0.5)
conv = sC * K_Fx'
conv .*= K_dt
end
core6 (generic function with 2 methods)

julia> c6 = @btime core6(\$sC, \$K_Fx);
268.070 ms (2 allocations: 19.43 MiB)

julia> K = rand(1, 3801);

julia> K = K[:];

julia> K_Fx = Circulant(K);

julia> size(K_Fx)
(3801, 3801)

julia> c6 = @btime core6(\$sC, \$K_Fx);
13.670 s (8 allocations: 19.43 MiB)

``````

Perhaps it is better to write my own function to calculate a Circulant Matrix and return it as a type Matrix{Float64}. Or just write a function to restrict the Circulant Matrix type returned by the SpecialMatrices module so that it is always type Matrix{Float64}.

It is not the broadcast that copies, is the slice:

``````julia> x = [1,2,3]; y = [1,2,3,4,5,6]; z = zeros(3);

julia> @btime \$z .= \$x .* \$y[1:3]
31.560 ns (1 allocation: 112 bytes)
3-element Vector{Float64}:
1.0
4.0
9.0

julia> @btime \$z .= \$x .* @view(\$y[1:3])
17.578 ns (0 allocations: 0 bytes)
3-element Vector{Float64}:
1.0
4.0
9.0

julia> @btime @views(\$z .= \$x .* \$y[1:3])
17.610 ns (0 allocations: 0 bytes)
3-element Vector{Float64}:
1.0
4.0
9.0

``````
1 Like

Oh I missed this, reading from the bottom. Any weird wrapper will mean `*` can’t use BLAS.

But it’s also structure you can perhaps exploit. Here SpecialMatrices.jl/circulant.jl at master · JuliaMatrices/SpecialMatrices.jl · GitHub it seems to have special methods for `*` with a vector. By writing `fft(x,1)` etc. you could probably extend that to `*` with a matrix. I see Remove Toeplitz, Hankel, and circulant matrices by devmotion · Pull Request #40 · JuliaMatrices/SpecialMatrices.jl · GitHub points towards ToeplitzMatrices.jl which may have a more complete version?

I played around with this a bit but have just fixed it to type Matrix. I have rewritten my code taking on board the suggestions here, it is now type stable, does not create numerous copies of matrices and there are no unneeded structs. The whole code now executes in 172 ms - a big improvement, and even faster than MATLAB!

3 Likes

So, now the question is how to make your MATLAB code fast like Julia? Good luck 9 Likes
2 Likes