Porting code from MatLab - performance tips

Hello, so I am an undergrad porting code from MatLab into Julia with the objective of improving performance, but so far performance has been disappointing: Julia: 58 seconds/instance, MatLab: 18 seconds/instance (thousands of simulations are being run - so the time adds up to hours).

I’ve read and implemented everything obviously applicable in performance tips, avoiding global variables, annotating all types, pre-allocating memory, moving code into small functions, all of which did speed the program up, but only to the current 58 seconds/instance.

For reference, the best possible time I can achieve for the simulation is ~8 seconds/instance with a (very optimized) Rust program (which to me implies that Julia is going to struggle to improve on the Matlab - given the already strong performance of the MatLab code).

For reference the code consists of multiplying lots of slices of arrays. Below is an example of one the most ‘hot’ functions. Is there a better, more efficient way? (Julia also seems to consume similar amounts of memory to MatLab but way more - like orders of magnitude more - than Rust)

function A(b::Matrix{Float64},x::Float64,v::Float64,c::Matrix{Float64},d::Matrix{Float64})::Terms

    @fastmath @inbounds var_1 = v*x*(c[1,3]*b[1,:]-d[1,3]*b[3,:])     
    @fastmath @inbounds var_2 = v*x*(c[3,7]*b[3,:]-d[3,7]*b[7,:])        

    ...

    Terms(var_1, var_2 ...
end

(For those of you wondering why I deleted the earlier question of the same type - I accidentally posted it before I was done writing it up - probably just should of edited it).

3 remarks about your code snippet

  • The type annotations for the function arguments have no impact on performances (but could be useful for dispatch or documentation)
  • Slices like b[1,:] return (allocated) copies and you should use views instead (@views).
  • Is is not a complete MWE
5 Likes

Something else to consider is that this view isn’t contiguous; Arrays are strictly column-major. (Just in case, the undocumented Base.iscontiguous doesn’t seem intended to check AbstractArray iteration contiguity, doesn’t line up with adjoint arrays for example). Depending on the element type, platform, and length, copying into a contiguous vector could end up saving time if it is reused enough (less garbage). I wonder how MATLAB is handling this tradeoff.

Might be possible to make row-views contiguous from a Transpose matrix (wraps an Array)? Not experienced enough to know how to make one smoothly, especially since the inputs are unknown, or if the performance is the same as working with Arrays directly.

Very likely a lot more going on in the code, a couple expressions is not much to go on.

2 Likes

Very likely a lot more going on in the code, a couple expressions is not much to go on.

Undoubtably true and very reasonable point - but I can’t copy and paste a thousand or so lines of code into a forum!

I’ll look into making row-views contiguous from a Transpose matrix, thanks for the idea, I’m willing to experiment, so I’ll see if that helps.

The sizes of all arrays are known at compile time - and I’m hoping I can better leverage this knowledge.

Just as this topic is admittedly too broad, I wouldn’t narrow strategies this quickly. “Performant contiguous view of a row of a Matrix” sounds like a good, focused topic for a thread here. I would highly expect some to have dealt with this problem before and reached some practice. More detail on how those matrices are instantiated and accessed (are you entirely or mostly taking row slices versus column slices, any linear algebra) would be recommended there, it could even affect whether you use a wrapper to access columns as rows versus rewriting rows into columns in the first place.

There is StaticArrays.jl, but the performance does get worse than Arrays above some size. The rough rule of thumb is <100 elements, but in practice it’s usually much smaller, like a position in a few dimensions. Regardless of size, methods based on StaticArrays do have to get compiled for each size, so you’d want fewer unique sizes and a lot more reuse of those methods.

The type annotations for the function arguments have no impact on performances (but could be useful for dispatch or documentation)

Oh thats useful and time saving knowledge, thanks!

I’ll look into views, the seem like the way forward, in rust I can pass references to functions to avoid re-allocating memory, but wasn’t sure how to achieve this in Julia.

As for it not being a MWE, I apologise for that, I’m not exactly sure how to condense the program sufficiently.

1 Like

Your lines of code should probably look a bit more like this:

@views var_1 = (v * x) .* (c[1, 3] .* b[:, 1] .- d[1, 3] .* b[:, 3])

Take column slices, not row slices, and take care with broadcasting, the above line will fuse operations instead of creating several intermediate arrays.

Other remarks:

  • A full MWE would help.
  • Matlab is generally good at straight array code like this, so not easy to beat.
  • Matlab Implicitly multi-threads code, you may need to parallelize your code to compare fairly.
  • @inbounds is not necessary for slices like the above, and @fastmath should be among the very last things you try. I don’t think I ever use it, and it may change the results numerically.
6 Likes

I had a look into StaticArrays.jl but most of the arrays I’m working with are 100-1000+ elements in size, so the standard libraries probably a better bet.

More detail on how those matrices are instantiated and accessed (are you entirely or mostly taking row slices versus column slices) would be recommended there, it could even affect whether you use a wrapper versus transposing data before writing into a Matrix.

I am almost exclusively taking row slices (at least in the ‘hot’ parts of the code), so maybe I should just transpose them at initialisation.

iirc Rust multidimensional arrays default to row major so that could explain some of the difference from the MATLAB performance, which is already optimized for handling matrices

1 Like

I fully recognise the point about a full MWE - the census here seems to be that any MWE beats none - so I’ll try and come back with one. I was aware that MatLab implicitly multithreads - but it uses exactly the same resources as the Julia program - 100% of a single core and nothing more - so kinda doubt (but may well be wrong) that its gaining any significant average.

@fastmath should be among the very last things you try. I don’t think I ever use it, and it may change the results numerically.

Very reasonable point - I was initially just shocked at the performance discrepancy and trying everything and anything in the performance tips reference (it helped that I didn’t have to rewrite masses of code to implement it either).

For smallish arrays like you are using, multithreading each line (which is probably what Matlab would do) may not pay off. But since Julia gives you a lot more explicit control over threads, you could get a speedup by parallelizing over the different lines, like this

@threads for i in eachindex(...)
    var[i] = @views ... 
end

Naturally, you must figure out the pattern for the index access, and so on.

1 Like

Owning to the embarrassingly parallel nature of each instance/individual simulation - I plan to utilise threading to run instances simultaneously - rather than directly involving them in the nitty-gritty part of the code.

And thanks, simply utilising @views, as suggested by many, as well as rewriting my lines how you suggested (I am yet to employ column slices - that will require slightly greater modification) has reduced running time to ~35 second/instance, which is significant improvement!

2 Likes

Sounds like a good idea :+1:

That’s good, but to be clear, it seems very unlikely that Rust should be 4.5x faster than Julia here, and also that Matlab should be twice as fast, so there must be some significant inefficiencies in some parts of your code. If you make an MWE, it should be roughly 2x and 4x slower than similar examples in Matlab and Rust, in order to actually demonstrate the problem.

(Anything more than 20-30% slower than Rust probably warrants investigation.)

3 Likes

If the whole code is too large, perhaps it would be worth it to figure out which part is the bottleneck. You can then post only that part and ask people to help you make it more efficient.

You can read about profiling Julia code here: Optimizing your code (also, if you’re new to Julia, you might find other useful tips in there).

1 Like

I think that’s the general (con)sensus here when it comes to issues where users run into errors (you really want to be able to reproduce the error) but for performance related question it’s important that an MWE doesn’t send people down optimization rabbit holes, so in producing MWEs I would encourage:

  • As @barucden said above, profile your larger code base to see where time is actually spent to ensure you’re not optimizing something that won’t matter in the grand scheme of things;
  • Make sure the MWE roughly matches your actual problem in terms of data types and size and produce relevant dummy data (of course if it takes >>minutes to run it’s helpful to reduce the problem size, but try to do so while retaining the characteristics of the problem - e.g. if your actual code is working on 100x100 matrices and does something 1,000 times, make the MWE 100x100 but 10 times only rather than 10x10 1,000 times)
  • As @DNF says, once you’ve found a potential MWE, check that that MWE still exhibits the relative slowdown vs Matlab/Rust

And to give a semi-constructive suggestion as well, at times people find that Matlab/Julia differences on matrix algebra heavy code are down to the fact that Matlab ships with MKL by default, while Julia uses OpenBLAS. The impact will depend on your CPU and the code you’re running, sometimes it’s material while sometimes there’s no difference at all, but using MKL is a fairly simple thing to try.

6 Likes

I decided to do a benchmark:

function foo(v, x, b, c, d)
    v*x*(c[1,3]*b[1,:]-d[1,3]*b[3,:])
end

function bar(v, x, B, C, D)
    @views (v * x) .* (C[1, 3] .* B[1, :] .- D[1, 3] .* B[3, :])
end

function baz(v, x, B, C, D)
    @views (v * x) .* (C[1, 3] .* B[:, 1] .- D[1, 3] .* B[:, 3])
end

v, x = randn(), randn();
N = 1000;
B_, C_, D_ = randn(5, N), randn(5, N), randn(5, N);
B, C, D = randn(N, 5), randn(N, 5), randn(N, 5);
julia> @btime foo($v, $x, $B_, $C_, $D_);
  3.900 μs (6 allocations: 47.62 KiB)

julia> @btime bar($v, $x, $B_, $C_, $D_);
  970.000 ns (1 allocation: 7.94 KiB)

julia> @btime baz($v, $x, $B, $C, $D);
  549.215 ns (1 allocation: 7.94 KiB)

As you can see, adding @views and broadcasting dots, reduced the allocations from 5 redundant intermediate arrays plus one output array to a single output array, and the runtime by a factor of ~4. Further, changing the orientation to column slices had a smaller impact (and even less so for smaller arrays.)

Based on this, I am pretty sure the remainder of the performance discrepancy with Matlab and Rust is not related to the above array calculations, but resides in some part of the code that we haven’t seen yet.

2 Likes

I’ll add an extra remark, look at this:

function foo_(v, x, b, c, d)
    (c[1,3]*b[1,:]-d[1,3]*b[3,:])*v*x
end

I just moved the scalar multiplications behind the array expression.

julia> @btime foo_($v, $x, $B_, $C_, $D_);
  4.367 μs (7 allocations: 55.56 KiB)

Now there are 7 allocations, and a further slowdown. What happens in the above code is:

b[1,:]  # one allocation
c[1,3]*b[1,:] # another allocation when multiplying

d[1,3]*b[3,:] # this is also two allocations, one for the slice, one for the multiplication

c[1,3]*b[1,:]-d[1,3]*b[3,:]  # subtraction will allocate a fifth array

# next, you multiply with v, for sixth allocation:
(c[1,3]*b[1,:]-d[1,3]*b[3,:])*v

# finally, you multiply with x for a seventh allocation
(c[1,3]*b[1,:]-d[1,3]*b[3,:])*v*x

The reason this doesn’t happen in foo is that x * y * z is actually the same as (x * y) * z, so you do the scalar multiplication first, and then multiply with the array. But it’s a good thing to be aware of, try to group scalar operations together, it’s wasteful to multiply two scalars with an array in separate steps, instead of doing the scalar multiplication first, and afterwards doing the array multiplication.

This is also the reason why I didn’t write

(v .* x) .* (C[1, 3] .* B[:, 1] .- D[1, 3] .* B[:, 3])

but dropped the dot in (v * x), since that would cause one extra multiplication for each element. Not a big deal, but good to keep in mind. If I had used the dot macro, @., to dot everything, like this:

@views @. (v * x) * (C[1, 3] * B[:, 1] - D[1, 3] * B[:, 3])

it would also have dotted the v * x operation, giving a small performance cost.

3 Likes

As others have mentioned, without a MWE is hard to say anything. In any case, some broad comments that add to the discussion.

  1. For multithread being possible, you need to enable it before initializing Julia. Otherwise, even if you indicate multithreading at the code level, it won’t be multithreaded. If you’re using VS Code, you can directly set “number of threads” in Settings, and then restart the session.

  2. To quickly see whether copies/views play any major role, you could directly add @views before the function:

function foo(x)
	<code>
end

has to become

@views function foo(x)
	<code>
end

This will say “treat every slice as a view”, saving a lot of annotations if your function has multiple lines.

  1. Notice that adding . to some operations and not others will break the so-called “loop fusion”. This prevents some optimizations and creates unnecessary allocations. There’s a macro to indicate that every operation should be broadcasted, which requires prepending the expression with @..

  2. In my experience, nothing beats @turbo (or its threaded option @tturbo) from the package LoopVectorization. However, this has been deprecated in Julia 1.11 (until the new substitute is launched). You should also consider the package Tullio, although I don’t know if its increases in performances are retained once that LoopVectorization has been deprecated.

  3. Unlike MATLAB, for-loops are fast. In fact, in my work, prepending the for-loop with @turbo tends to get the best results. If speed is my top priority, I’d think of for-loops.

  4. If you have an Intel processor, you should consider the package MKL. This implements linear algebra specialized for Intel processors and can make a huge difference. For only using specific vectorized functions specialized for Intel processors, consider also IntelVectorMath.

3 Likes

Oh, and note, btw, that the row/column orientation issue also applies to Matlab, so you may see a speedup there as well by fixing this.

1 Like