How to make my Julia code fast like MATLAB?

The core of this looks to be the triple loop at the end. You have many implementations, and these have very different performance, perhaps it’s worth gathering them. But what it’s working out is matrix multiplication, and calling the library function is much faster than all these other ways. Making the problem slightly smaller so as not to wait too long:

julia> sC = rand(670, 1801);  # I've shrunk 3801 => 1801 to test these in reasonable time

julia> K_Fx = rand(1801, 1801);

julia> using BenchmarkTools

julia> c1 = @btime core1($sC, $K_Fx);  # lots of slices
  16.307 s (6033352 allocations: 82.00 GiB)

julia> c2 = @btime core2($sC, $K_Fx);  # simple triple loop
  7.585 s (2 allocations: 9.21 MiB)

julia> c3 = @btime core3($sC, $K_Fx);  # matrix-vector loop
  854.113 ms (2012 allocations: 37.18 MiB)

julia> c4 = @btime core4($sC, $K_Fx);  # mapslices
  785.864 ms (5202 allocations: 27.97 MiB)

julia> c6 = @btime core6($sC, $K_Fx);  # matrix-matrix, one BLAS call
  27.140 ms (2 allocations: 9.21 MiB)

# ===== with these definitions: =====

function core1(sC, K_Fx, K_dt = 0.1)
    nr, K_ntc = size(sC)
    conv = similar(sC)
    for t in 1:K_ntc  # K.ntc = 3801
        for q in 1:nr  # nr = 670
            conv[q, t] = sum(collect(sC[q, 1:K_ntc]) .* collect(K_Fx[t, :]))*K_dt  # usiing sum function method
        end
    end
    conv
end

function core2(sC, K_Fx, K_dt = 0.1)  # version in first message
    nr, K_ntc = size(sC)
    conv = similar(sC)
    for t in 1:K_ntc  # K.ntc = 3801
        for q in 1:nr  # nr = 670
            for j in 1:K_ntc
                conv[q, t] += sC[q, j] * K_Fx[t, j] * K_dt  # convolute with kernel
            end
        end
    end
    conv
end

function core3(sC, K_Fx, K_dt = 0.1)
    nr, K_ntc = size(sC)
    conv = similar(sC)
    for q=1:nr
        conv[q, :] = (K_Fx * sC[q, 1:end]) .* K_dt
    end
    conv
end

function core4(sC, K_Fx, K_dt = 0.1)  # adapted
    conv = mapslices(x -> K_Fx * x * K_dt, sC; dims=2)
end

function core5(sC, K_Fx, K_dt = 0.1)  # same as 2, with threading + @simd @inbounds
    nr, K_ntc = size(sC)
    Threads.@threads for t in 1:K_ntc  # K.ntc = 3801
        for q in 1:nr  # nr = 670
            @simd for j in 1:K_ntc
                @inbounds conv[q, t] += sC[q, j] * K_Fx[t, j] * K_dt 
            end
        end
    end
    conv
end

function core6(sC, K_Fx, K_dt = 0.1)
    conv = sC * K_Fx'
    conv .*= K_dt
end

Since 7.585 / 0.027 == 281, you might expect the full-size problem to be reduced from 368 seconds to 1.3. And if Matlab is in this range, then probably it’s also writing one matrix multiplication. (Possibly even calling the exact same BLAS library!) Maybe the other parts of the program will now also take an appreciable part of the time.

However, there might still be faster ways to do this. Convolutions can be written using fourier transforms, and I think FFT algorithms scale better with size. Maybe these are packaged up nicely somewhere, maybe DSP.jl is where, not sure.

Lots of good code tips above. Mine would be to delete all the struct definitions and just make named tuples to pass groups of parameters etc. around, at least at this stage. You could delete all types except perhaps generate_kernel(fwhm::Real, time::AbstractVector, flag::Bool) as reminders to your future self (and to produce less confusing errors).

7 Likes