How to make my Julia code fast like MATLAB?

Kyle7 · May 23, 2021, 12:17pm

Hi All, I am new to Julia and really struggling to understand what is making my code which convolutes a signal with a Gaussian or Lorentzian so slow. I have the equivilent MATLAB code that runs in 4 seconds when vectorised properly. In Julia, it takes:

368.047741 seconds (19.37 G allocations: 288.664 GiB, 14.30% gc time, 0.03% compilation time)

This is regardless of me using a triple nested for loop to calculate the sum, a double nested loop and the inbuilt sum() function, or replacing the nested loops with a multiplication of the row of every signal with the circulant kernel matrix - either using * or mul!() are both slow. It is probably worth mentioning that the signal array is size (3801, 670) after padding the orginal signal and the kernel is size (3801, 3801).

The code is as follows, please feel free to tear into it…

using MAT
using SpecialMatrices

wd = "/path/to/files"

inputfile = matopen(wd*"/AllTraj_Signal.mat")
signal = read(inputfile, "pdW")
close(inputfile)
replace!(signal, NaN=>0)
signal = signal[2:671, 1:2001]
signal[:, 1] = signal[:, 2]

inputfile = matopen(wd*"/Vectors.mat")
qAng = read(inputfile, "qAng")
tvec = read(inputfile, "tt")


struct Gaussian
    Fx:: Array{Float64}
    sigma:: Float64
    height:: Float64
    fwhm:: Float64
    tconv:: StepRangeLen{Float64}
    ntc:: Int64
    duration:: Float64
    dt:: Float64
    tmin :: Float64
    tmax:: Float64
    nt:: Int64
    function Gaussian(fwhm:: Float64, x:: StepRangeLen{Float64}, x0:: Float64, tmin:: Float64, tmax::Float64, nt:: Int64)
        duration = fwhm*3
        sigma = fwhm/2sqrt(2log(2))
        height = 1/sigma*sqrt(2pi)
        Fx = height .* exp.( (-1 .* (x .- x0).^2)./ (2*sigma^2))
        Fx = Circulant(Fx)
        tconv = x; ntc = length(x)
        dt = tconv[2]-tconv[1]
        nt = nt
        new(Fx, sigma, height, fwhm, tconv, ntc, duration, dt, tmin, tmax, nt)
    end
end

struct Lorentzian
    Fx:: Array{Float64}
    gamma:: Float64
    fwhm:: Float64
    tconv:: StepRangeLen{Float64}
    ntc:: Int64
    duration:: Float64
    dt:: Float64
    tmin :: Float64
    tmax:: Float64
    nt:: Int64
    function Lorentzian(fwhm:: Float64, x:: StepRangeLen{Float64}, x0:: Float64, tmin::Float64, tmax:: Float64, nt:: Int64)
        gamma = fwhm/2
        duration = fwhm*3
        Fx = 1 ./ ((pi * gamma) .* (1 .+ ((x .- x0)./ gamma).^2 ))
        Fx = Circulant(Fx)
        tconv = x; ntc = length(x)
        dt = tconv[2]-tconv[1]
        nt = nt
        new(Fx, gamma, fwhm, tconv, ntc, duration, dt, tmin, tmax, nt)
    end
end


function generate_kernel(fwhm:: Float64, time:: StepRangeLen{Float64}, Flag:: Int)

    dt:: Float64 = sum(diff(time, dims=1))/(length(time)-1)
    nt:: Int64 = length(time)
    duration:: Float64 = fwhm*3 # must pad three time FHWM to deal with edges
    if dt != diff(time, dims=1)[1] ; error("Non-linear time range."); end

    tmin:: Float64 = minimum(time); tmax:: Float64 = maximum(time)
    tconv_min:: Float64 = tmin - duration ; tconv_max:: Float64 = tmax + duration
    tconv = tconv_min:dt:tconv_max # extended convoluted time vector

    if Flag == 0
        K = Gaussian(fwhm, tconv, tconv[1], tmin, tmax, nt)
    elseif Flag == 1
        K = Lorentzian(fwhm, tconv, tconv[1], tmin, tmax, nt)
    else
        error("Only Gaussian and Lorentzian functions currently available.")
    end
    return K
    end

function convolve(S:: Array{Float64}, K:: T) where {T<:Union{Gaussian, Lorentzian}}

    nr:: Int64, nc:: Int64 = size(S)
    if nr != K.nt && nc != K.nt; error("Inconsistent dimensions."); end
    if nc != K.nt; S = copy(transpose(S)); nr, nc = nc, nr; end # column always time axis

    padend:: Int64 = (K.duration/K.dt) # index at which 0 padding ends and signal starts
    padstart:: Int64 = (padend + K.tmax / K.dt) # index where signal ends and padding starts

    println("Convolving signal with a $(typeof(K)) function with FWHM: $(K.fwhm) fs.")

    sC = zeros(Float64, nr, K.ntc)
    sC[1:nr, 1:padend] .= S[1:nr, 1] # extended and pad signal
    sC[1:nr, padend:padstart] .= S
    sC[1:nr, padstart:end] .= S[1:nr, end]
    conv = zeros(Float64, nr, K.ntc)

    for t in 1:K.ntc  # K.ntc = 3801
        for q in 1:nr  # nr = 670
            #conv[q, t] = sum(collect(sC[q, 1:K.ntc]) .* collect(K.Fx[t, :]))*K.dt  # usiing sum function method
            for j in 1:K.ntc
                conv[q, t] += sC[q, j] * K.Fx[t, j] * K.dt  # convolute with kernel
            end
        end
    end

    return conv
end

fwhm = 100.0
tvec = 0:0.5:1000
kernel = generate_kernel(fwhm, tvec, 0)
conv = convolve(signal, kernel)

I have tried to define all my types and split up the generation of the kernel into two types and keep the actual convolution seperate from this, but it has not helped speed wise. Any tips or insights would be greatly appreciated!

cgeoga · May 23, 2021, 1:21pm

A couple notes. For one, you are creating a nice circulant matrix, but Fx::Array{Float64} is not only throwing away that nice structure I think by converting to an Array, but it is actually an abstract type, so that’s not ideal.

Also, while I could be missing something, I think if you actually kept the structure of the circulant matrix, just doing struct.Fx*... would be very fast and effective because then you’d actually be using the structure of the circulant matrix.

EDIT: Oh, and one more thing—maybe there’s some reason you’re doing it this way, but DSP.jl I think provides some nicely tuned and generic convolution functions and stuff, so there are plenty of very convenient ways to do this without ever building a matrix or anything.

sostock · May 23, 2021, 1:21pm

Since I don’t have your input files, I can obviously not run the code (it would be better to submit a MWE that does not need those files), but I will try to give some suggestions anyway.

Most of the type annotations here do not affect performance at all. The only place where you need the types is in the declaration of the struct fields, and in that case it is crucial for performance that they are concrete types. Some of the fields in your structs are abstractly typed (Array{Float64} and StepRangeLen{Float64} are not concrete). If you cannot specify them completely, you can use a parametric struct, e.g.:

struct Gaussian{T<:Circulant, S<:StepRangeLen{Float64}}
    Fx::T
    sigma::Float64
    height::Float64
    fwhm::Float64
    tconv::S
    ntc::Int64
    duration::Float64
    dt::Float64
    tmin::Float64
    tmax::Float64
    nt::Int64
end

I didn’t look at the rest of your code yet, this is just the first thing I noticed.

Edit: I didn’t notice at first that you are actually trying to use a special matrix type. In this case, you should of course use Circulant instead of Array in your struct.

Raf · May 23, 2021, 1:23pm

From a cursory look your objects are not type-stable. Array{Float634} is not a concrete type.

Either use Array{Float64,2} or similar, or just use a type parameter everywhere like A and let them be anything, and you wont have this problem again. Probably the same advice applies to other fields of your objects too, and maybe some code. If the types in hot code are not stable, the performance will not be as good as Matlab.

Kyle7 · May 23, 2021, 2:13pm

Thanks! So replacing Fx:: Array{Float64} with Fx:: Circulant{Float64}? When I am defining structs it is a good idea to be as specific as possible regarding the type then, instead of using Array I should use Matrix, or Vector depending on the situation? I have wrapped the nested for loop in its own function too, this speeds the calculation up to about 60 seconds.

I wanted to do it this way more for my own satisfaction, I like to try understanding a concept by coding it myself first. I shall look into DSP, I imagine it is a lot fast than my code…

Kyle7 · May 23, 2021, 2:17pm

Ok great, thanks for the tips! Why would defining the struct as,

struct Gaussian{T<:Circulant, S<:StepRangeLen{Float64}}
    Fx::T
    sigma::Float64
    height::Float64
    fwhm::Float64
    tconv::S
    ntc::Int64
    duration::Float64
    dt::Float64
    tmin::Float64
    tmax::Float64
    nt::Int64
end

be better than,

struct Gaussian
    Fx::Circulant
    sigma::Float64
    height::Float64
    fwhm::Float64
    tconv::StepRangeLen{Float64}
    ntc::Int64
    duration::Float64
    dt::Float64
    tmin::Float64
    tmax::Float64
    nt::Int64
end

The former gave me the following error,
ERROR: LoadError: syntax: too few type parameters specified in "new{...}"

cgeoga · May 23, 2021, 2:28pm

I think the most “julian” advice would be to try and write your structs as specifically as possible while still leaving it as generic/flexible as possible. Which sounds sort of contradictory, but as @sostock suggested, what’s nice about something like

struct Gaussian{F}
  Fx :: F
  ....

which is a very common pattern in the language, is that if you pass in a special struct that has neat methods for accelerated Base.:*, then the Gaussian struct will preserve that special object and all of your code that uses Fx will benefit. But if for some reason I need to pass in an Fx that isn’t a circulant matrix or is somehow incompatible with some part, for example being a type that doesn’t work with the required fft under the hood of circulant matrix implementations, then the resulting Gaussian{Matrix{WeirdType}} will still be nice and type-stable, so you won’t actually have to change any code.

AndiMD · May 23, 2021, 3:00pm

You could try to switch the order of your q and j loops. Julia is column-major, so the innermost loop should be the first index.
https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-column-major

Kyle7 · May 23, 2021, 3:46pm

Ok, I only tried to do this in a nested loop as the matrix multiplication I was using was very slow. Instead of the nested loop I wanted to multiply a signal vector by the circulant matrix. This is what takes 4 seconds in MATLAB.

for q=1:nr
    conv[q, :] = (K.Fx * sC[q, 1:end]) .* K.dt
end

Or with no loop,

conv = mapslices(x -> K*x, sC; dims=2)

mul is also very slow at this. Perhaps I would need to directly use BLAS routines to get comparable speed of matrix multiplication to MATLAB.

lmiq · May 23, 2021, 4:48pm

I am not following the thread in detail, but this in Julia creates a new array at every iteration of the loop. Use @view(sC[q,1:end]) or more simply @view(sC[q,:]).

You can also put @views in front of the function definition (@views function...) and then all slices of arrays in the function will be treated as views (not copies).

edit: your Julia code does not have that. sorry

lmiq · May 23, 2021, 4:56pm

The thing here is that perhaps Circulant has parameters. And if it has parameters, the struct is accepting all types of Circulant, which makes of it type-unstable, and the compiled functions cannot be specialized to a single type. (I do not know what Circulant is, but this is a general concept).

If you use

struct Gaussian{T}
   Fx::T

When you instantiate your variable of Gaussiant type it will specialize to a specific type of Circulant used in the specialization.

It is analogous to this, with vectors:

julia> struct A
         x::Vector
       end

julia> struct B{T}
         x::T
       end

julia> x = A(rand(Float64,3))
A([0.22087576765549888, 0.458109021965067, 0.8571115429656655])

julia> typeof(x)
A

julia> x = B(rand(Float64,3))
B{Vector{Float64}}([0.7203620220131972, 0.6465793604365255, 0.6881096049371642])

julia> typeof(x)
B{Vector{Float64}}

Note that type B carries the signature of the type of vector it contains, but A does not. That makes all the difference to the compiler, to optimize your code.

For example:

julia> x = A(rand(Float64,3))
A([0.7582620326245211, 0.3436082850830309, 0.10922331285448794])

julia> @btime sum($x.x)
  13.956 ns (1 allocation: 16 bytes)
1.21109363056204

julia> x = B(rand(Float64,3))
B{Vector{Float64}}([0.14043993284712464, 0.6316088798377522, 0.3744204932654418])

julia> @btime sum($x.x)
  1.918 ns (0 allocations: 0 bytes)
1.1464693059503186

mcabbott · May 23, 2021, 6:18pm

The core of this looks to be the triple loop at the end. You have many implementations, and these have very different performance, perhaps it’s worth gathering them. But what it’s working out is matrix multiplication, and calling the library function is much faster than all these other ways. Making the problem slightly smaller so as not to wait too long:

julia> sC = rand(670, 1801);  # I've shrunk 3801 => 1801 to test these in reasonable time

julia> K_Fx = rand(1801, 1801);

julia> using BenchmarkTools

julia> c1 = @btime core1($sC, $K_Fx);  # lots of slices
  16.307 s (6033352 allocations: 82.00 GiB)

julia> c2 = @btime core2($sC, $K_Fx);  # simple triple loop
  7.585 s (2 allocations: 9.21 MiB)

julia> c3 = @btime core3($sC, $K_Fx);  # matrix-vector loop
  854.113 ms (2012 allocations: 37.18 MiB)

julia> c4 = @btime core4($sC, $K_Fx);  # mapslices
  785.864 ms (5202 allocations: 27.97 MiB)

julia> c6 = @btime core6($sC, $K_Fx);  # matrix-matrix, one BLAS call
  27.140 ms (2 allocations: 9.21 MiB)

# ===== with these definitions: =====

function core1(sC, K_Fx, K_dt = 0.1)
    nr, K_ntc = size(sC)
    conv = similar(sC)
    for t in 1:K_ntc  # K.ntc = 3801
        for q in 1:nr  # nr = 670
            conv[q, t] = sum(collect(sC[q, 1:K_ntc]) .* collect(K_Fx[t, :]))*K_dt  # usiing sum function method
        end
    end
    conv
end

function core2(sC, K_Fx, K_dt = 0.1)  # version in first message
    nr, K_ntc = size(sC)
    conv = similar(sC)
    for t in 1:K_ntc  # K.ntc = 3801
        for q in 1:nr  # nr = 670
            for j in 1:K_ntc
                conv[q, t] += sC[q, j] * K_Fx[t, j] * K_dt  # convolute with kernel
            end
        end
    end
    conv
end

function core3(sC, K_Fx, K_dt = 0.1)
    nr, K_ntc = size(sC)
    conv = similar(sC)
    for q=1:nr
        conv[q, :] = (K_Fx * sC[q, 1:end]) .* K_dt
    end
    conv
end

function core4(sC, K_Fx, K_dt = 0.1)  # adapted
    conv = mapslices(x -> K_Fx * x * K_dt, sC; dims=2)
end

function core5(sC, K_Fx, K_dt = 0.1)  # same as 2, with threading + @simd @inbounds
    nr, K_ntc = size(sC)
    Threads.@threads for t in 1:K_ntc  # K.ntc = 3801
        for q in 1:nr  # nr = 670
            @simd for j in 1:K_ntc
                @inbounds conv[q, t] += sC[q, j] * K_Fx[t, j] * K_dt 
            end
        end
    end
    conv
end

function core6(sC, K_Fx, K_dt = 0.1)
    conv = sC * K_Fx'
    conv .*= K_dt
end

Since 7.585 / 0.027 == 281, you might expect the full-size problem to be reduced from 368 seconds to 1.3. And if Matlab is in this range, then probably it’s also writing one matrix multiplication. (Possibly even calling the exact same BLAS library!) Maybe the other parts of the program will now also take an appreciable part of the time.

However, there might still be faster ways to do this. Convolutions can be written using fourier transforms, and I think FFT algorithms scale better with size. Maybe these are packaged up nicely somewhere, maybe DSP.jl is where, not sure.

Lots of good code tips above. Mine would be to delete all the struct definitions and just make named tuples to pass groups of parameters etc. around, at least at this stage. You could delete all types except perhaps generate_kernel(fwhm::Real, time::AbstractVector, flag::Bool) as reminders to your future self (and to produce less confusing errors).

Kyle7 · May 23, 2021, 7:24pm

Yeah, when I time other parts of the code using the time macro it becomes clear the issue is the final loop at the end. I realise I can do this without the structs, this was the first proper programme I have written in Julia and have probably used them where they are not needed. I tried to implement core2() and core3() previously in my code, but it was really slow. As many have suggested, I guess this is to do with my unstable types. Wrapping the ending loop in its own function also makes a massive difference. One thing I do not really understand is why the sum() function is so slow.

I am unsure what you are running on but these functions perform significantly slower for me (running 1.6GHz dual-core i5, 8GB RAM, using Julia 1.6.1). I also fail to see where BLAS is called, I see no LinearAlgebra.BLAS., am I missing something obvious?

c2 = @btime core2($sC, $K_Fx);
32.369 s (2 allocations: 9.21 MiB)

c3 = @btime core3($sC, $K_Fx);
1.249 s (2012 allocations: 37.18 MiB)

c4 = @btime core4($sC, $K_Fx);
1.259 s (5202 allocations: 27.97 MiB)

c6 = @btime core6($sC, $K_Fx);
101.725 ms (2 allocations: 9.21 MiB)

I am familiar with the FFT algorithm, although have had some trouble with complex results so far. I will check out DSP for future use, I would just like to code these convolutions myself first.

Thanks for taking the time to benchmark and breakdown these functions, it has been really helpful!

mcabbott · May 23, 2021, 7:40pm

but it was really slow. As many have suggested, I guess this is to do with my unstable types. Wrapping the ending loop in its own function also makes a massive difference.

Yes. It’s totally fine to write type-unstable code which will run once, or when you don’t care. So long as (as you say) the bit which needs to be fast is walled off in its own function.

is why the sum() function is so slow.

Each thing like K_Fx[t, :] makes a new small array, which the garbage collector will have to clean up. And .* makes another. This is what “6033352 allocations: 82.00 GiB” is warning about. You could avoid both by writing @views dot(sC[q, 1:K_ntc], K_Fx[t, :]), but I think it still won’t be very quick.

I am unsure what you are running on but these functions perform significantly slower for me

This is on an M1 mac. (I also skipped timing the threaded core5 because threading seems broken, natively…) But anyway the main point is the relative speeds, which are similar.

fail to see where BLAS is called

Usually * ends up calling this, when it can (i.e. when it is given simple arrays of floats). In core4 the K_Fx * x is also I think calling a matrix * vector routine, but fusing all of these into one matrix * matrix allows them to be optimised further.

Kyle7 · May 23, 2021, 9:00pm

Yeah @views made some difference, but not a great amount. If .* makes a new copy of the array each time then I guess it is better to define a function and apply that to whatever array using dot notation?

So BLAS is called automatically for some specific types of function call.

One thing I have just realised, is that having a Matrix of type Circulant is slowing down my calculations…

julia> using SpecialMatrices

julia> using BenchmarkTools

julia> sC = rand(670, 3801);

julia> K_Fx = rand(3801, 3801);

julia> function core6(sC, K_Fx, K_dt = 0.5)
           conv = sC * K_Fx'
           conv .*= K_dt
       end
core6 (generic function with 2 methods)

julia> c6 = @btime core6($sC, $K_Fx);
  268.070 ms (2 allocations: 19.43 MiB)

julia> K = rand(1, 3801);

julia> K = K[:];

julia> K_Fx = Circulant(K);

julia> size(K_Fx)
(3801, 3801)

julia> c6 = @btime core6($sC, $K_Fx);
  13.670 s (8 allocations: 19.43 MiB)

Perhaps it is better to write my own function to calculate a Circulant Matrix and return it as a type Matrix{Float64}. Or just write a function to restrict the Circulant Matrix type returned by the SpecialMatrices module so that it is always type Matrix{Float64}.

lmiq · May 23, 2021, 9:08pm

It is not the broadcast that copies, is the slice:

julia> x = [1,2,3]; y = [1,2,3,4,5,6]; z = zeros(3);

julia> @btime $z .= $x .* $y[1:3]
  31.560 ns (1 allocation: 112 bytes)
3-element Vector{Float64}:
 1.0
 4.0
 9.0

julia> @btime $z .= $x .* @view($y[1:3])
  17.578 ns (0 allocations: 0 bytes)
3-element Vector{Float64}:
 1.0
 4.0
 9.0

julia> @btime @views($z .= $x .* $y[1:3])
  17.610 ns (0 allocations: 0 bytes)
3-element Vector{Float64}:
 1.0
 4.0
 9.0

mcabbott · May 23, 2021, 9:20pm

Oh I missed this, reading from the bottom. Any weird wrapper will mean * can’t use BLAS.

But it’s also structure you can perhaps exploit. Here https://github.com/JuliaMatrices/SpecialMatrices.jl/blob/master/src/circulant.jl#L26 it seems to have special methods for * with a vector. By writing fft(x,1) etc. you could probably extend that to * with a matrix. I see Remove Toeplitz, Hankel, and circulant matrices by devmotion · Pull Request #40 · JuliaLinearAlgebra/SpecialMatrices.jl · GitHub points towards ToeplitzMatrices.jl which may have a more complete version?

Kyle7 · May 24, 2021, 5:41pm

I played around with this a bit but have just fixed it to type Matrix. I have rewritten my code taking on board the suggestions here, it is now type stable, does not create numerous copies of matrices and there are no unneeded structs. The whole code now executes in 172 ms - a big improvement, and even faster than MATLAB!

tbeason · May 24, 2021, 5:43pm

So, now the question is how to make your MATLAB code fast like Julia? Good luck

carstenbauer · May 25, 2021, 7:03am

xref: for loop - Julia outperformed by MATLAB? - Stack Overflow

Topic		Replies	Views
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8335	March 5, 2017
My julia code is somehow much slower than the matlab code New to Julia question , performance , matlab	55	4207	December 30, 2022
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4517	December 20, 2017
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1709	June 25, 2023
Matlab versus Julia General Usage	33	4786	July 15, 2021

How to make my Julia code fast like MATLAB?

Related topics