# Improving function performance - Broadcasting vs Loops

I’m slowly learning some of the tricks that can make `julia` code quicker. Coming from `python` I still find it strange to write a loop instead of some vectorised but it seems that sometimes that the loop is the quicker option.

I’ve found this example where I’m just calculating the norm of vector when passing in a 3D array of many vectors, i.e. `cvec_1 = rand(50000,3)`.

I started with a broadcasted function and this was realtively quick:

``````function normalize_by_row(arr)
arr ./ sqrt.(sum(arr.^2, dims=2))
end

@btime nrm = normalize_by_row(\$cvec_1);
898.900 μs (10 allocations: 2.94 MiB)
``````

But I also wanted to go check the looped possibilities so I have tried a few different versions using what I’ve picked up about `@views` and `@inbounds`:

``````function normalize_by_row_v2(arr)
norms = similar(arr)
for i in axes(arr,1)
@views norms[i,:] = arr[i,:] / sqrt(sum(arr[i,:].^2))
end
return norms
end

function normalize_by_row_v3(arr)
norms = Vector{Float64}(undef, size(arr))
for i in axes(arr,1)
@views norms[i] = sqrt(sum(arr[i,:].^2))
end

return arr ./ norms
end

function normalize_by_row_v4(arr)
norms = Vector{Float64}(undef, size(arr))
temp = arr.^2
for i in axes(arr,1)
@views norms[i] = sqrt(sum(temp[i,:]))
end

return arr ./ norms
end

function normalize_by_row_v5(arr)
norms = Vector{Float64}(undef, size(arr))
temp = arr.^2
for i in axes(arr,1)
@inbounds @views norms[i] = sqrt(sum(temp[i,:]))
end

return arr ./ norms
end
``````

I guess I’ve had varying degrees of success, with the first one being both slow and memory hungry and the final looped version being nearly 20% faster than the broadcasted one:

``````julia> @btime nrm2 = normalize_by_row_v2(\$cvec_1);
4.016 ms (110166 allocations: 13.03 MiB)

julia> @btime nrm3 = normalize_by_row_v3(\$cvec_1);
1.999 ms (55086 allocations: 7.56 MiB)

julia> @btime nrm4 = normalize_by_row_v4(\$cvec_1);
782.800 μs (6 allocations: 2.94 MiB)

julia> @btime nrm5 = normalize_by_row_v5(\$cvec_1);
748.200 μs (6 allocations: 2.94 MiB)
``````

The final one seems to be pretty good, but I’m wondering are there any further improvements that could be made to such a function?

I aslo thought I could use `LoopVectorization` and `@turbo` in v5 of the function, but it doesn’t really work and throws an error:

``````ERROR: ArgumentError: invalid index: VectorizationBase.MM{4, 1, Int64}<1, 2, 3, 4> of type VectorizationBase.MM{4, 1, Int64}
``````

So can I get any faster? Is there some other important `julia` concept that I’ve missed?

Yes. Julia arrays are column major, so you should work along columns, not rows.

4 Likes

In the first one, the allocations are probably because of an intermediate vector created inside `sum`. Try this:

(I’m not sure if that works with the dims keyword)

In all cases probably you can avoid the creation of intermediary arrays completely by computing the norms one by one in a loop over rows.

(If you can use a row major structure and normalize by column, as pointed above, that will make a huge difference)

Don’t be afraid to use `for` loops. For example:

``````function normalize_by_row_v7(arr)
u = similar(arr)
@inbounds for i in axes(arr, 1)
anorm2 = zero(eltype(arr))
for j in axes(arr, 2)
anorm2 += arr[i, j]^2
end
anorm = sqrt(anorm2)
for j in axes(arr, 2)
u[i, j] = arr[i, j] / anorm
end
end
end

@btime normalize_by_row(\$cvec_1);
1.033 ms (10 allocations: 2.67 MiB)

# Function defined above
@btime normalize_by_row_v7(\$cvec_1);
472.210 μs (2 allocations: 1.14 MiB)
``````

You can probably do even better by using `@turbo`. And as @DNF said, this would probably be much faster if the vectors were aligned along the first dimension, as in `cvec_1 = rand(3, 50000)`.

1 Like

Work along columns, and try to avoid allocations:

``````function normalize_loop!(arr)
@views for i in axes(arr, 2)
v = sqrt(sum(abs2, arr[:, i]))
arr[:, i] ./= v
end
return arr
end
``````

This has zero allocations.

3 Likes

This is quick, but it’s not returning the correct answer unless I transpose the array before calling the function, which is suspect would make it slower than @jipolanco’s solution. I need to have a better look after lunch to see the difference in these.

It seems that I should possibly just think more about rewriting everything to be column-major in future though.

Also, should there be an `@inbounds` for the for loop here?

Yes, my point is that you should make sure that your data is column major in the first place.

I tried `@inbounds` but it made no difference.

This is much quicker than mine, but unfortunately `@turbo` doesn’t work. Have I done something wrong?

``````
function normalize_by_row_v7(arr)
u = similar(arr)
@turbo @inbounds for i in axes(arr, 1)
anorm2 = zero(eltype(arr))
for j in axes(arr, 2)
anorm2 += arr[i, j]^2
end

anorm = sqrt(anorm2)
for j in axes(arr, 2)
u[i, j] = arr[i, j] / anorm
end
end
return u
end

Stacktrace:
 getproperty(x::Core.NewvarNode, f::Symbol)
@ Base .\Base.jl:33
@ LoopVectorization C:\Users\jpmor\.julia\packages\LoopVectorization\ZPTim\src\constructors.jl:47
``````

Seem like it possibly can’t be made any quicker. Not bad though, already about 4x quicker than the existing python implementation. I’m slowly beginning to think it would be worth all the effort to convert everything to `julia`!!

Here’s a pretty quick version, if you are willing to use StaticArrays. This is useful if you know that one of the dimensions is a fixed number, for example if you are working in 3D space:

``````using StaticArrays, LinearAlgebra

mynormalize!(arr::AbstractArray{<:AbstractVector}) = (arr ./= norm.(arr))

X = rand(SVector{3, Float64}, 50_000);

jl> @btime mynormalize!(\$X);
116.500 μs (0 allocations: 0 bytes)
``````
1 Like

Yes, I can nearly always one dimension so it seems that `StaticArrays` is something I need to learn more about too.

This is nearly 10x quicker than my well optimised `python` implementation so this is more than I hoped for.

You can also take a look at GitHub - mateuszbaran/HybridArrays.jl: Arrays with both statically and dynamically sized axes in Julia</ti which allows you to keep the data as a matrix. But frankly, I think that for storing a set of coordinates, it makes more sense to keep them as a vector of vectors, in which case StaticArrays are brilliant.

Another possibility: I cannot get any speedup from `@tturbo`, and `Threads.@threads` need longer vectors to become efficient on my computer. But if I directly use Polyester.jl:

``````using Polyester
function mynormalize_p!(arr::AbstractArray{<:AbstractVector})
@batch for i in eachindex(arr)
arr[i] = arr[i] ./ norm(arr[i])
end
return arr
end

jl> @btime mynormalize_p!(\$X);
26.200 μs (1 allocation: 16 bytes)
``````

I’ll fix this in a future release, but for now `@turbo` requires “perfectly nested” loops. Meaning that for now it requires

``````for i in I
...
for j in J
...
end
...
end
``````

``````for i in I
...
for j in J
...
end
...
for j in J
...
end
...
end
``````

So you can try `@fastmath`, or add either `@turbo` or `@simd` to the inner loops.

Or, this

``````using LoopVectorization, StrideArraysCore

normalize_by_row_v8(arr) = normalize_by_row_v8!(similar(arr), arr)
function normalize_by_row_v8!(
u::AbstractMatrix{T}, arr::AbstractMatrix{T}
) where {T}
W = LoopVectorization.pick_vector_width(eltype(arr))
W4 = W * LoopVectorization.StaticInt(4)
buffer = StrideArray{T}(undef, (W4,))
N = size(arr,1)
for ii in 0:W4:N-W4
@turbo for i in 1:W4
anorm2 = zero(eltype(arr))
for j in axes(arr, 2)
anorm2 += arr[i+ii, j]^2
end
buffer[i] = sqrt(anorm2)
end
@turbo for i in 1:W4
for j in axes(arr, 2)
u[i+ii, j] = arr[i+ii, j] / buffer[i]
end
end
end
@inbounds @fastmath for i in (N & -W4)+1:N
anorm2 = zero(eltype(arr))
for j in axes(arr, 2)
anorm2 += arr[i, j]^2
end
anorm = sqrt(anorm2)
for j in axes(arr, 2)
u[i, j] = arr[i, j] / anorm
end
end
return u
end
using StaticArrays, LinearAlgebra
normalize_sarray(x) = x ./ norm.(x)

x = rand(50_000,3);
x2 = StrideArray(x, (50_000, StaticInt(3)));
y = copy(reinterpret(reshape, SVector{3,Float64}, x'));

normalize_by_row_v8(x) ≈ normalize_by_row_v8(x2) ≈ reinterpret(reshape, Float64, normalize_sarray(y))'

@benchmark normalize_by_row_v8(\$x)
@benchmark normalize_sarray(\$y)
@benchmark normalize_by_row_v8(\$x2)
``````

I get

``````julia> normalize_by_row_v8(x) ≈ normalize_by_row_v8(x2) ≈ reinterpret(reshape, Float64, normalize_sarray(y))'
true

julia> @benchmark normalize_by_row_v8(\$x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):   78.538 μs …   1.461 ms  ┊ GC (min … max):  0.00% … 79.56%
Time  (median):      93.183 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   114.820 μs ± 118.901 μs  ┊ GC (mean ± σ):  16.23% ± 13.86%

▆█▅▃▁                                                         ▁
███████▅▄▁▃▁▁▁▁▁▃▆▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▅▆██▇▆▆▆▆ █
78.500 μs     Histogram: log(frequency) by time    789.000 μs <

Memory estimate: 1.14 MiB, allocs estimate: 2.

julia> @benchmark normalize_sarray(\$y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  152.999 μs …   1.299 ms  ┊ GC (min … max):  0.00% … 67.91%
Time  (median):     155.702 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   180.734 μs ± 118.867 μs  ┊ GC (mean ± σ):  10.32% ± 12.76%

█▅▃▂▁                                                         ▁
██████▇▆▇▅▄▅▁▁▁▁▁▃▄▇▃▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▃▅▆▆▇▇▆▇▇▇ █
153.000 μs    Histogram: log(frequency) by time    903.000 μs <

Memory estimate: 1.14 MiB, allocs estimate: 2.

julia> @benchmark normalize_by_row_v8(\$x2)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):   66.263 μs …   1.288 ms  ┊ GC (min … max):  0.00% … 84.27%
Time  (median):      93.333 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   113.447 μs ± 113.662 μs  ┊ GC (mean ± σ):  15.93% ± 13.89%

▁▅█▃▂▁                                                        ▁
███████▆▆▅▁▁▁▄▃▁▁▁▄▇▅▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▇██▇▇▆ █
66.300 μs     Histogram: log(frequency) by time    795.000 μs <

Memory estimate: 1.14 MiB, allocs estimate: 2.
``````

Of course, adding parallelism (e.g. with `@batch`) to the outer `for ii in 0:W4:N-W4` loop should speed it up even more, but in that case we’d need to create the buffer inside the loop, to be thread safe.

If you are working with particles or points (the dimension 3 suggests so), StaticArrays makes this very natural:

``````using BenchmarkTools
using StaticArrays
using LinearAlgebra

struct Point <: FieldVector{3,Float64}
x::Float64
y::Float64
z::Float64
end

points = rand(Point,50_000)

function normalize_points!(points)
@inbounds @simd for i in eachindex(points)
points[i] = points[i]/norm(points[i])
end
end

``````

which gives here:

``````julia> @btime normalize_points!(\$points)
139.368 μs (0 allocations: 0 bytes)

``````

That is roughly equivalent to the single-liner above (`mynormalize!(arr) = (arr ./= norm.(arr))`, use what you feel more comfortable with (keeping in mind that many times the loop allows you do things that are hard to write as single liners).