 # Comparing Numba and Julia for a complex matrix computation

I am think about slowly switching from Python to Julia. My main problem is the experience. Some years experience in writing code with python compared to: only read some books about Julia (as ‘Julia High Performance’ by Avik Sengupta).
What I hope to get is a better speed per time used to optimize ratio.

To understand the possible gains and what I have to make the code as fast as possible, I wanted to start with some code that is already highly optimized, but still at least a factor of 10 to slow to be used in production. From the books read I got some crazy ideas, I wanted to test, multi threading, GPU, distributed computing (of different local computers).

The original Numba-Numpy-Python code(of the minimal code example) can be found here: python - Faster iteration over 2D Numpy/CuPy arrays based on unique values - Stack Overflow

I wanted to translate the code to Julia, but got stuck in the very first lines. Where I create matrices, that are constant and used in the loop. I want to create a 2d matrix of the atan(of its shifted indices).

What I want to get I a matrix is the matrix of angles (in integer degrees) measured from the starting point (where both x_- and y_offset is 0)
My Idea was to create a 2d Matrix with the x indexes and a 2d matrix of the shifted indices (by half its width) and than calculate the atan from them with the following code:

``````x_offsets = [Array(0:width-1)  .- (width-1)/2 for x in 0 : height-1];
y_offsets = [fill(x, width-1) for x in 0: height-1];

angle_mat = round.(2 .*(90 .- rad2deg.(atan.(y_offsets, x_offsets))) .+ 180)
``````

This code does not work. I guess, because I use Inner and Outer Vector instead of a a Matrix and therefore I get a problem with the inner Array and the dot Operators, because the dot only works on the outer array instead of working on all values.

I hop, you can give me some advice/tips, for both how achieve the overall aim and solve the current problem as well.

Are you looking for something like the following?

``````height, width = 187, 746
round.(Int, 2 .* (90 .- atand.(0:height-1, (0:width-1)')) .+ 180)
``````
2 Likes

I should add that this is almost certainly the wrong way to go about getting performance improvements.

“Highly optimized” code in Python usually means code that is “vectorized” as much as possible — broken into a series of relatively simple array operations that can individually call fast library operations. (For example, calling `atan` on a bunch of numbers in your code above.) You can write the same sort of code in Julia, of course, but it won’t be magically faster than Numpy — there’s no secret ingredient in Julia that allows us to (for example) subtract two arrays faster.

Instead, the key to unlocking performance benefits is to realize that in Julia you don’t have to limit yourself to “vectorized library” functions for performance. For example, you perhaps shouldn’t create your array of `atan` values at all, but instead try to combine the `atan` computation with subsequent or previous computations. (Loops are fast in Julia, and there are packages like LoopVectorization to make them even faster.)

For the same reason, micro-benchmarks of a single elementary operation (like computing a bunch of `atan` values) are unlikely to reveal much. Numpy is probably already going as fast as it is possible to go for this operation by itself. You really need to look at a non-trivial sequence of calculations.

(If you have trouble with this, feel free to post optimized Python code that performs a dozen lines of “vectorized” calculations, along with sample inputs, and ask for help in producing an optimized Julia equivalent.)

27 Likes

I wish I can make a living just doing this sometimes (I mean the producing Julia equivalent part)

4 Likes

There are many things to address here, but because I have little time, I’ll just say: never do the above. You’re turning a lightweight range into a big old array. Just leave it as a range.

7 Likes

Another thing, that @sudete has already highlighted, is that there is no need to create all those repeated arrays, where you simply have multiple copies of the same column or row many times. It is not necessary in Julia, and it is not necessary in Python. Both Julia and Numpy support broadcasting – numpy for some operations, and Julia universally through the dot operator.

So in Julia you just write

and in numpy you can write

``````np.arctan2(x, y)
``````

if `x` is Mx1 and `y` is 1xN (it does not apparently work for vectors, only matrices).

The creation of repeated, meshgrid-like arrays are an anti-pattern that should be avoided both in Julia, Python and Matlab (yes, Matlab, too, supports broadcasting to some extent.)

5 Likes

Exactly. That’s what I intended to do, but was not able to. I didn’t now that I have to transpose the second vector. That’s why I tried the crude version.

I finally was able to get the example running. Primarily I wanted to ask, whether the code and it’s style is okay, before I want to optimize it. I thought it would be somewhat faster even on a single core. It also has a lot of allocations!

``````height, width = 187, 746;
org_sized = rand(Float32, (2001, 2001)) * 60;
height_mat = rand(Float32, (height, width)) * 100;
angle_mat = round.(Int32, 2 .* atand.(0:height-1, (0:width-1)' .-(width/2-1))) .+ 1;

enlarged_size = zeros(eltype(org_sized), size(org_sized, 1) + height, size(org_sized, 2) + width);
enlarged_size[1:size(org_sized, 1), range(Int(width/2), length=size(org_sized, 2))] = org_sized;

function computeSumHours(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
height, width = size(height_mat)
short_elevations = enlarged_size[y:y+height, x:x+width]

for x2 in 1:width
for y2 in 1:height
overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
angle = angle_mat[y2, x2]
end
end
end
end

end

function computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
for x in 1:size(org_sized, 2) - 1
for y  in 1:size(org_sized, 1) - 1
shadow_time_hrs[x, y] = computeSumHours(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
end
end
end

@time result = computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
``````

The timing:

1827.429890 seconds (12.02 M allocations: 2.050 TiB, 2.64% gc time, 0.00% compilation time)

What would be the next steps (after fixing the code) you would take to optimize it?

Your code is not working, the weights array is not initialized. Update it to fix it, if you want any help with the code it is a lot better if the code can run.

Use a `@view` here:

``````short_elevations = @view enlarged_size[y:y+height, x:x+width]
``````

Slices create new vectors by default in Julia. You can also use `@views function...` such that all slices are trated as views inside the function.

You should also consider preallocating `shadowed_segments`, which is being created on every iteration of the outer loop. These two things should solve most allocations. (

Ah sorry it seams like I missed a like when I copied everything. The new code with the @view is. It reduces the execution time by 50 % and the allocation by 2 orders of magnitude.
I also corrected a logical error, due which a matrix of 0’s was returned.

``````height, width = 187, 746;
org_sized = rand(Float32, (2001, 2001)) * 60;
height_mat = rand(Float32, (height, width)) * 100; # originally values getting larger from (0, width//2) to the outside with the distance squared

angle_mat = round.(Int32, 2 .* atand.(0:height-1, (0:width-1)' .-(width/2-1))) .+ 1;

enlarged_size = zeros(eltype(org_sized), size(org_sized, 1) + height, size(org_sized, 2) + width);
enlarged_size[1:size(org_sized, 1), range(Int(width/2), length=size(org_sized, 2))] = org_sized;

weights = rand(Float32, 361)/ 10 .+ 0.01;  # weights originally larger in the middle

function computeSumHours(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
height, width = size(height_mat)
short_elevations = @view enlarged_size[y:y+height, x:x+width]

for x2 in 1:width
for y2 in 1:height
overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
angle = angle_mat[y2, x2]
end
end
end
end

end

function computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
for x in 1:size(org_sized, 2) - 1
for y  in 1:size(org_sized, 1) - 1
shadow_time_hrs[x, y] = computeSumHours(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
end
end
end

@time result = computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
unique(result)
``````

Timing:

937.750795 seconds (4.06 M allocations: 11.448 GiB, 0.07% gc time, 0.00% compilation time)

What’s next is because I don’t use side effects I don’t gain anything from initialize the globals as constants, right?

Is it possible to combine multithreading and using the VPU?

1 Like

A question about the preallocation of `shadowed_segments`!

I was not sure whether I should do that, because I change the values in the function. Will the array still be zero outside the function and therefore at the start of the function?

I have not looked at the code with all detail. But, if you are assigning new values to the array, there is no need for it to be filled with “zeros”, it can be anything. If you need them to be actually “zeros” on the begining of the function, you can reset it without allocating it again.

Also, if the size changes from iteration to iteration, you can allocate the maximum possible size and pass to the inner function a view of the larger array with the proper size.

You can pre-allocate the `shadowed_segments` array and pass it in as a function arguments, but you should re-initialize it to zero on each call (with `shadowed_segments .= 0`.)

Ref.: ipynb file

1. Original version with smaller `org_sized = rand(Float32, (401, 401)) * 60`

``````height, width = 187, 746;
#org_sized = rand(Float32, (2001, 2001)) * 60;
org_sized = rand(Float32, (401, 401)) * 60;
height_mat = rand(Float32, (height, width)) * 100; # originally values getting larger from (0, width//2) to the outside with the distance squared

angle_mat = round.(Int32, 2 .* atand.(0:height-1, (0:width-1)' .-(width/2-1))) .+ 1;

enlarged_size = zeros(eltype(org_sized), size(org_sized, 1) + height, size(org_sized, 2) + width);
enlarged_size[1:size(org_sized, 1), range(Int(width/2), length=size(org_sized, 2))] = org_sized;

weights = rand(Float32, 361)/ 10 .+ 0.01;  # weights originally larger in the middle

function computeSumHours(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
height, width = size(height_mat)
short_elevations = @view enlarged_size[y:y+height, x:x+width]

for x2 in 1:width
for y2 in 1:height
overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
angle = angle_mat[y2, x2]
end
end
end
end

end

function computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
for x in 1:size(org_sized, 2) - 1
for y  in 1:size(org_sized, 1) - 1
shadow_time_hrs[x, y] = computeSumHours(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
end
end
end

@time result = computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
@time result = computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
@time result = computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
r0 = result;
``````

Timings:

`````` 28.222513 seconds (358.24 k allocations: 480.214 MiB, 0.23% gc time, 0.22% compilation time)
28.034453 seconds (160.00 k allocations: 468.750 MiB, 0.18% gc time)
27.827226 seconds (160.00 k allocations: 468.750 MiB, 0.17% gc time)
``````

2. Pre-allocated version with `weights = Float32.(weights)` (The eltype of the original `weights` is `Float64`.)

``````weights = Float32.(weights)

function computeSumHours!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x, shadowed_segments)
height, width = size(height_mat)
short_elevations = @view enlarged_size[y:y+height, x:x+width]

@inbounds for x2 in 1:width
for y2 in 1:height
overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
angle = angle_mat[y2, x2]
end
end
end
end

end

function computeAllLines!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
for x in 1:size(org_sized, 2) - 1
for y  in 1:size(org_sized, 1) - 1
end
end
end

@time result = computeAllLines!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
@time result = computeAllLines!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
@time result = computeAllLines!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
r1 = result;
``````

Timings (~30% faster):

`````` 20.481910 seconds (207.30 k allocations: 12.225 MiB, 0.29% compilation time)
20.319880 seconds (1 allocation: 1.594 KiB)
20.339460 seconds (1 allocation: 1.594 KiB)
``````

``````weights = Float32.(weights)

function computeSumHours!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x, shadowed_segments)
height, width = size(height_mat)
short_elevations = @view enlarged_size[y:y+height, x:x+width]

for y2 in 1:height
@inbounds overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
angle = angle_mat[y2, x2]
end
end
end
end

end

function computeAllLines!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
for x in 1:size(org_sized, 2) - 1
for y  in 1:size(org_sized, 1) - 1
end
end
end

@time result = computeAllLines!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
@time result = computeAllLines!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
@time result = computeAllLines!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
r2 = result;
``````

Timings (~ 5 times faster):

``````Threads.nthreads() = 12
5.817959 seconds (10.12 M allocations: 997.270 MiB, 3.60% gc time, 1.31% compilation time)
5.754861 seconds (9.95 M allocations: 987.269 MiB, 3.29% gc time)
5.781787 seconds (9.95 M allocations: 987.292 MiB, 3.26% gc time)
``````

4. Consistency

``````r0 ≈ r1 ≈ r2
``````
``````true
``````
3 Likes

If you thread the outer loop probably you get better scaling results. And adding a ` @simd` flag to the inner loop also helps (be careful with concurrency with all threse, in your real case). Here, with 4 threads:

`````` nthreads = 4
43.783321 seconds (85.06 k allocations: 4.931 MiB, 0.16% compilation time)
9.358107 seconds (258.46 k allocations: 15.135 MiB, 0.07% gc time, 1.21% compilation time)
Test Passed
``````

(allocations are from compilation, you can run the code twice in the same Julia section and they will mostly disappear)

``````using Base.Threads
using Test

# original non-allocating
function computeSumHours_orig(
org_sized, enlarged_size, angle_mat, height_mat, weights, y, x,
)
height, width = size(height_mat)
short_elevations = @view enlarged_size[y:y+height, x:x+width]

for x2 in 1:width
for y2 in 1:height
overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
angle = angle_mat[y2, x2]
end
end
end
end

end

function computeAllLines_orig(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
for x in 1:size(org_sized, 2) - 1
for y  in 1:size(org_sized, 1) - 1
org_sized, enlarged_size, angle_mat, height_mat, weights, y, x,
)
end
end
end

function computeSumHours(
org_sized, enlarged_size, angle_mat, height_mat, weights, y, x,
)
height, width = size(height_mat)
short_elevations = @view enlarged_size[y:y+height, x:x+width]

for x2 in 1:width
@inbounds @simd for y2 in 1:height
overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
angle = angle_mat[y2, x2]
end
end
end
end

end

function computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
@threads for x in 1:size(org_sized, 2) - 1
for y  in 1:size(org_sized, 1) - 1
org_sized, enlarged_size, angle_mat, height_mat, weights, y, x,
)
end
end
end

import Random
Random.seed!(123)

T = Float32
height, width = 187, 746;
#org_sized = rand(T, (2001, 2001)) * 60;
org_sized = rand(T, (401, 401)) * 60
height_mat = rand(T, (height, width)) * 100; # originally values getting larger from (0, width//2) to the outside with the distance squared

weights = rand(T, 361)/ 10 .+ 0.01
angle_mat = round.(Int64, 2 .* atand.(0:height-1, (0:width-1)' .-(width/2-1))) .+ 1;
enlarged_size = zeros(eltype(org_sized), size(org_sized, 1) + height, size(org_sized, 2) + width);
enlarged_size[1:size(org_sized, 1), range(Int(width/2), length=size(org_sized, 2))] = org_sized;
weights = rand(T, 361)/ 10 .+ 0.01;  # weights originally larger in the middle

@time result0 = computeAllLines_orig(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
@time result = computeAllLines(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights);
@test result ≈ result0

``````

@leandromartinez98 Thanks! I wasn’t to get multiprocessing running.

I also tried `@turbo` and
`@fastmath`, only worked when I put it in front of the comparison, but not in front of the for-loop! and it looks like the it’s even slower. At first I was wondering why @simd didn’t make the code faster, but looking at @code_native it looks like, vectorization was turned on even without the macro.

Error for `@turbo` was:

``````if overshadowed
angle = angle_mat[y2, x2]
end
end

 push!(ls::LoopVectorization.LoopSet, ex::Expr, elementbytes::Int64, position::Int64)
@ LoopVectorization C:\Users\sebas\.julia\packages\LoopVectorization\PDSoT\src\modeling\graphs.jl:1180
``````

Does it mean it’s unable to set the value of the vector?
What can I do to solve it?

Additionally I tried the Code on different processors!
For Example an Core-7 6700K and an Ryzen-7 4800H the later is a laptop CPU but has 8 real cores, while the Intel CPU only has 4 Real core. But the codes even runs about 25 % faster on the Intel machine (8 cores vs 8 cores). Although on benchmarks the Ryzen is sad to be 42 % faster for 8 cores vs 8 cores. Therefore I wanted to ask, whether you know a reason. I also Compared the speed of a Ryzen 2400G vs the 4800H which almost has the same speed difference as noted in the Benchmarks. So it is a Intel vs Ryzen thing? In Python I’d guessed that avx2 isn’t used. I checked that with @code_native, it’s fine. The measurement was done with `@btime`.

By the way the Julia Code is almost 80 % faster on the Core-7 than Python.

Is it possible to run the run one or both of the functions on a GPU? Is it easy?
If not what I do when I try distributed computing. Would I look like the multi-threading code? In a sense, that also the inner loop of the the outer function is shared? Or should I do something different? The communication to another PC is much slower than on a single CPU.

Sometimes just adding an `@inbounds` flag allows the compiler to turn on some vectorization. `@turbo` does not handle all possible expressions, and that is why I didn’t use it there. But I cannot help much with what or if, you can change the code to allow it to run. You would need to remove the branch somehow.

I think it is possible, yes, but I don’t have much experience with that. I would guess that it is quite easy in this case (meaning you probably have to change only a few things in the code, like how the vectors are initialized and, perhaps, use Floops with CudaEx().).

You should always try to parallelize the code to reduce at most communication, thus at the higher level possible. Particularly if you are using distributed computing.

1 Like

Hi. It took me a little while to get my head around the code and still somethings I’m unclear on. But, it struck me that there might be some intermediate steps that could be skipped so I took that approach as follows:

``````function computeSumHours5(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
height, width = size(height_mat)
short_elevations = @view enlarged_size[y:y+height-1, x:x+width-1]
result = 0.0

@views @inbounds for x2 in 1:width
for y2 in 1:height
overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
result += weights[ angle_mat[y2, x2] ]
end
end
end

return result
end

function computeAllLines5(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
for x in 1:size(org_sized, 2) - 1
for y  in 1:size(org_sized, 1) - 1
shadow_time_hrs[x, y] = computeSumHours5(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
end
end
end
``````

As you can see I the skipped intermediate ‘shadowed_segments’ calculation -since this just gets summed in the end anyhow, and just added to a float each iteration. As to performance compared to base:

``````julia> @btime res0 = computeAllLines(\$org_sized, \$enlarged_size, \$angle_mat, \$height_mat, \$shadow_time_hrs, \$weights);
35.746 s (160000 allocations: 468.75 MiB)

julia> @btime res5 = computeAllLines5(\$org_sized, \$enlarged_size, \$angle_mat, \$height_mat, \$shadow_time_hrs, \$weights);
19.833 s (0 allocations: 0 bytes)

julia> res0 == res5
true
``````

So that’s almost halved the time and I think you could improve upon that with some of the other suggestions above (I’m new at Julia I’m not so good at threading and such yet).

However, one thing I don’t understand in your question is the angle_mat - why is this a matrix? There are alot of repeated values in the matrix due to the rounding in its generation, so I wonder if there could be efficiencies gained by avoiding repeated calculations later on - but maybe I missed something as to why this is matrix with alot of repeated values?

1 Like

It looks like @jroon’s simplified and allocation-free approach also makes it easy to add both SIMD vectorization and multithreading with the `@tturbo` macro from @elrod’s always awesome LoopVectorization.jl:

``````using LoopVectorization
function computeSumHours5_lv(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
height, width = size(height_mat)
short_elevations = @view enlarged_size[y:y+height-1, x:x+width-1]
result = 0.0

@tturbo for x2 in 1:width
for y2 in 1:height
overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
# Note that we have to eliminate the branch to SIMD-vectorize:
result += overshadowed * weights[ angle_mat[y2, x2] ]
end
end

return result
end

function computeAllLines5_lv(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
for x in 1:size(org_sized, 2) - 1
for y  in 1:size(org_sized, 1) - 1
shadow_time_hrs[x, y] = computeSumHours5_lv(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
end
end
end
``````

Which gives us more than a 16x speedup on an 8-core CPU!

``````julia> @btime res5 = computeAllLines5(\$org_sized, \$enlarged_size, \$angle_mat, \$height_mat, \$shadow_time_hrs, \$weights);

20.298 s (0 allocations: 0 bytes)

julia> @btime res5_lv = computeAllLines5_lv(\$org_sized, \$enlarged_size, \$angle_mat, \$height_mat, \$shadow_time_hrs, \$weights);

1.280 s (0 allocations: 0 bytes)

julia> res5 == res5_lv
true
``````
2 Likes