Julia vs Matlab - building a Jacobian matrix

Hello, I have decided to take a leap and use Julia instead of Matlab, hence I have converted a previous Matlab script in Julia.
The only thing the script does is to build a Jacobian matrix of a map. Everything is vectorized (no loop, only matrix operation: repeat / .(operator) / sparse.) and yet it takes at least twice the time in Julia vs Matlab (using @elpased function(…)).

This is the code:

module jac

using SparseArrays
using LinearAlgebra

	export DF

	function DF(x,m0,m,lambda,n,l)

		#############################
		# masses on the i-th circle #
		#############################

		theta = repeat(360/l*collect(0:1:l-1),1,n)
		xi_ext = repeat(x,1,l)'
		m_ext = repeat(m,1,l)'

		A = - m_ext
		B = similar(Array{Float64},axes(xi_ext))
		B .= sqrt(2)*(xi_ext.^3).*(ones(l,n) - cosd.(theta)).^(1/2)
		@views A[1,:] = zeros(n)
		@views B[1,:] = ones(n)

		# combine
		DF = sparse(collect(1:1:n),collect(1:1:n),vec(lambda*ones(n,1)-2*m0./x.^3 + sum(A./B,dims=1)'))

		#############################
		# masses on the j-th circle #
		#############################

		# angles associated with each mass
		theta = 360/l*collect(0:1:l-1)
		theta = repeat(theta,1,(n-1)*n) # extend theta

		# extend xi
		# l=0   -> [ x_1 ... x_1 | ... | x_n ... x_n ]
		# l=1   -> [ x_1 ... x_1 | ... | x_n ... x_n ]
		#   .      [      .      | ... |      .      ]
		# l=l-1 -> [ x_1 ... x_1 | ... | x_n ... x_n ]
		xi_ext = repeat(reshape(repeat(x,1,n-1)',n*(n-1),1),1,l)'

		# extend xj
		# l=0   -> [ x_2 x_3 ... x_n | ... | x_1 x_2 x_3 ... x_{n-1} ]
		# l=1   -> [ x_2 x_3 ... x_n | ... | x_1 x_2 x_3 ... x_{n-1} ]
		#   .      [        .        | ... |             .           ]
		# l=l-1 -> [ x_2 x_3 ... x_n | ... | x_1 x_2 x_3 ... x_{n-1} ]
		xj_ext = repeat(x,n,l)'
		@views xj_ext = xj_ext[:,setdiff(1:end, (0:1:n-1)+(1:n:n*n))]

		# extend mj the same way as xj
		mj_ext = repeat(m,n,l)'
		@views mj_ext = mj_ext[:,setdiff(1:end, (0:1:n-1)+(1:n:n*n))]

		##### Computate the derivatives
		xi_carre = xi_ext.^2
		xj_carre = xj_ext.^2
		prod_xixjcos = xi_ext.*xj_ext.*cosd.(theta)

		# diagonal elements of DF
		A, B = similar(Array{Float64},axes(xi_ext)), similar(Array{Float64},axes(xi_ext))
		A .= -mj_ext.*(4*xi_carre + xj_carre - 8*prod_xixjcos + 3*xj_carre.*cosd.(2*theta))
		B .= 2*(xi_carre + xj_carre - 2*prod_xixjcos).^(5/2)
		sum_j=sum(reshape(sum(A./B,dims=1),n-1,n),dims=1)

		# non-diagonal elements of DF
		A .= -mj_ext.*(-4*(xi_carre + xj_carre).*cosd.(theta) + xi_ext.*xj_ext.*(7*ones(l,(n-1)*n) + cosd.(2*theta)))
		sum_theta = sum(A./B,dims=1)

		# combine
		return DF + sparse(collect(1:1:n),collect(1:1:n),vec(sum_j)) + [tril(reshape(sum_theta,n-1,n)',-1) zeros(n,1)]+[zeros(n,1) triu(reshape(sum_theta,n-1,n)')]
	end
end

Three possible issues:

  1. the way I run the code is not efficient. What I did: I created a module, say X, in a .jl. This module contains the function creating the Jacobin matrix. Then, I created a main.jl file which executes the function using include(“/path/to/MODULE/X.jl”) and X.FUNCTION_NAME(…). I launch from the Julia interface the script main.jl with include(“/path/to/SCRIPT/X.jl”).

  2. the functions repeat / sparse are not as efficient in Julia.

  3. it has to do with “Multi-threading / parallelism computing” but I still don’t understand how to use/implement it in my code.

Could anyone help me understand what I have done wrong in the code?

Thank you for your help!

1 Like

It would be easier to analyze it if you show the exact call you are using for your benchmark. What are the values of all the arguments to jac.DF?

1 Like

Indeed, sorry about that. This is the main.jl file I wrote:

using LinearAlgebra
using SparseArrays

include("/path/to/module/jac.jl")


const lambda=-1
const n=100
const l=100
const m0=0
const m = ones(n,1)
x = reshape(collect(1:n),n,1)



@elapsed jac.DF(x,m0,m,lambda,n,l)

A few suggestions:

  1. Check out the performance tips for some general advice

  2. By only running the function once, you’re mostly just timing the compilation process. Using GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language shows that, while the first @elapsed takes 1.02 seconds, your function actually only needs 0.28 seconds once compiled.

  3. Your code is allocating a huge amount of memory because of all of the vectorization you’re doing. Is your algorithm actually easiest to express in this way, or are you just writing vectorized code because it was necessary in MATLAB? Loops in Julia are fast, and it’s very often the case that the best thing to do is to write down whatever your “scalar” computation is as a function, then call that function on each data point. In particular, if all you’re doing is constructing the jacobian of some vector function, you may not need to write any code at all. Have you tried https://github.com/JuliaDiff/ForwardDiff.jl ?

  4. Are the sparse operations actually necessary? Your final result is dense, so it’s not clear if all the conversions to and from sparse are helping at all.

5 Likes

A few quick thoughts:

(1) It looks like you should be using . a lot more often. For example, when you add two matrices, you should use .+ rather than +.

(2) Given that you reuse things like collect(1:n) many times, you should just create the object once and reuse it.

(3) Matlab will automatically multithread a lot of vectorized operations. You would need to do this manually in Julia. You can accomplish this with a Threads.@threads for loop over the indexes of your vectorized operations and doing the operation element by element. See this part of the docs.

(4) You’re using repeat and reshape a lot. I only have a minute right now, so I can’t look closely, but in my experience, one rarely actually needs these in Julia and should be using the same piece of memory repeatedly rather than allocating the same object many times to different pieces of memory.

Thank you for your suggestions.

Suggestion 1.: I have already looked at it and try to use as many tips as I could.

Suggestion 2.: but if I launch again the main.jl I obtain the same result. Is it because a reload everything ?

Suggestion 3.: well I vectorized because it is always faster than loops no (even if Julia is not as slow as Matlab for loops)? Even though it takes a lot more memory.

Suggestion 4.: absolutely, the matrix at the end is dense. I thought it was faster to build a sparse matrix. I am guessing from this comment that it actually can slow the computer to use sparse.

Thanks for your help! I will look into Threads.@threads then. Yes I use repeat and reshape to remove all possible loops. That way all the computations are simply vectorized operations, the cost is only the memory not the speed, isn’t it ?

  1. Great!

  2. Correct. Native code is not cached between runs of Julia. This is a known issue, but it’s really only significant if you’re only calling your function once. Presumably in your actual code the jacobian is computed many times, right? In that case, the extra 0.7 seconds of compilation shouldn’t matter much, and it’s more important to look at the execution time once the code is compiled. BenchmarkTools.jl tells you exactly that.

  3. No, that’s not true. “Vectorized” code is only faster than loops when the language itself is incapable of executing loops quickly. That’s why it’s encouraged in “slow” languages like MATLAB and Python. Julia is completely different in that respect: code using loops in Julia can be faster and allocate less (or no) memory compared to “vectorized” code. In that respect, Julia is like C, C++, and Fortran.

Or, put it another way: how do you think your “vectorized” code in MATLAB is being executed? At some point MATLAB is calling a C/C++/Fortran library which is just going to execute a loop over every element in that vector. So you’re telling MATLAB to construct a vector, and then MATLAB is telling C or C++ or Fortran to loop over that vector. In Julia you can just skip all that complexity. Just write a loop. For more detail, check out More Dots: Syntactic Loop Fusion in Julia

  1. It’s hard to say what will be faster a priori. Just try both ways with BenchmarkTools.jl and see which one works better.
4 Likes

In my experience the vast majority of time spent in most (slow) programs is spent allocating and garbage collecting (memory operations).

Thank you very much, I am going to look into all of this and get back to you as soon as possible.

I believe this is the same thought as what @rdeits explained to me: vectorizing is not necessarily the best option. From what you are telling me, when I build all my matrices I “collect garbage” thus slowing the computation.

Yes. More precisely, your program has to find a nice continuous space in memory to store the whole replicated matrix/vector and then take the time to fill it up (allocation). Then, at the end, it has to figure out whether any of it is actually still being used and if not make the space available in the future (garbage collection). If you do as little of these as possible, your code will be much faster. I’m also a Matlab user, and this is the main way in which I have found performance improvements over Matlab. Being able to write de-vectorized code means that you can avoid all of these unnecessary allocations and garbage collections.

Here’s an example. Let’s say we want to compute the sum of the differences of some vector x. We can write that in a “vectorized” style:

julia> function sum_of_diff_1(x)
         sum(diff(x))
       end
sum_of_diff_1 (generic function with 1 method)

Or we could just write out a loop:

julia> function sum_of_diff_2(x)
         r = zero(eltype(x))  # get whatever the "zero" element of x is
         for i in 2:length(x)
           r = r + (x[i] - x[i - 1])
         end
       end
sum_of_diff_2 (generic function with 1 method)

Let’s see how they perform:

julia> using BenchmarkTools

julia> @btime sum_of_diff_1($(rand(10_000)))
  9.022 μs (4 allocations: 78.25 KiB)

julia> @btime sum_of_diff_2($(rand(10_000)))
  5.153 μs (0 allocations: 0 bytes)

The loop version is ~40% faster and allocates less memory. So it’s performance is strictly better than the vectorized code. That’s because the loop code and the vectorized code are essentially doing the same computations, but the vectorized one also has to allocate a whole new array just to hold the result of diff. Allocating that array is unnecessary: we don’t need the array of differences, we just need each element one at a time. So we save time and memory by just writing a loop.

Of course, you could reasonably argue that the vectorized code is shorter and perhaps even clearer in this case. That’s a trade-off that happens sometimes, but I find that it’s usually possible to write non-vectorized code that is pretty clean and also very efficient.

4 Likes

Edit: my benchmark was slightly unfair, as my “looped” code wasn’t actually returning anything. A more fair comparison would be:

julia> function sum_of_diff_2(x)
         r = zero(eltype(x))  # get whatever the "zero" element of x is
         
         @inbounds @simd for i in (firstindex(x) + 1):lastindex(x)
           r = r + (x[i] - x[i - 1])
         end
         r
       end
sum_of_diff_2 (generic function with 1 method)

julia> @btime sum_of_diff_2($(rand(10_000)))
  1.394 μs (0 allocations: 0 bytes)

which is way faster due to the @inbounds and @simd annotations, which tell Julia that it is allowed to use SIMD instructions to compute multiple parts of that loop in parallel.

2 Likes

No, reshape is lazy and doesn’t allocate.

1 Like

Fair point!

Alright, I just finished writing everything in loops. I just define DF= zeros(n,n) and built it with DF[i,j] = .... for all i and j.

Indeed, now the code gives 388.985ms (no loops) versus 55.308ms (only loops). Thank you for clarifying a LOT of things!

I have a question regarding @inbounds @simd. I am not sure to understand when I should specifically call them. Should I always use them and Julia will figure it out for me ?

1 Like

That’s great! This is exactly the magic of Julia.

No, I’d suggest things like @simd and @inbounds should be left until you’re sure they’re necessary and you’re sure the code is already correct. They do different things, but they are often used together:

@inbounds tells Julia it is allowed to eliminate array bounds checking. That means that if you have an array x = [1,2,3] and you try to do x[4], Julia will normally throw a BoundsError(). But if you do @inbounds y = x[4], Julia is free to just…well…basically do anything. Usually you’ll either get garbage data or a crash. Sometimes both! This is one of those “great power and great responsibility” things. So only add @inbounds when (a) you are totally sure you’re not accessing any arrays out of bounds and (b) whatever speedup you get is actually worth the risk that you might introduce subtle bugs.

Another way to put it is that @inbounds indexing makes Julia behave like C and C++. That’s useful in the innermost loop of an algorithm, but it’s something you need to be careful with.

@simd enables SIMD instructions which can process multiple values in a single CPU instruction. It has its own caveats (it can give slightly different results for floating-point operations because floating-point math is complicated and non-associative). I think @simd usually also requires @inbounds because in order to process multiple data points simultaneously, Julia needs to be able to guarantee that it won’t need to throw a BoundsError part-way through executing your loop.

4 Likes

Fantastic, thank you. I am going to bother you one last time: for the syntax I don’t understand after looking at the docs if I should call @inbounds @simd at the beginning of all the loops or I can call it once in front of my function?

Probably best to put them both directly on the loops that require them. They’re both important flags to readers that something potentially fishy might be going on, and having those flags close to the code they affect is a useful indicator. @inbounds in particular can be dangerous, so you want it to affect as little code as possible. So I’d recommend

function foo()
  @inbounds for i in 1:10 
     ...
  end
  
  ... # this line still gets normal bounds checks
   
  @inbounds for j in 1:20
    ...
  end
end

rather than:

function foo()
  @inbounds begin
    for i in 1:10 
       ...
    end
  
    ... # Oops, this line also has no bounds checks. I hope it doesn't have any bugs!
   
    for j in 1:20
      ...
    end
  end
end
1 Like