Improving function performance - Broadcasting vs Loops

jpmorr · August 17, 2021, 9:12am

I’m slowly learning some of the tricks that can make julia code quicker. Coming from python I still find it strange to write a loop instead of some vectorised but it seems that sometimes that the loop is the quicker option.

I’ve found this example where I’m just calculating the norm of vector when passing in a 3D array of many vectors, i.e. cvec_1 = rand(50000,3).

I started with a broadcasted function and this was realtively quick:

function normalize_by_row(arr)
   arr ./ sqrt.(sum(arr.^2, dims=2))
end

@btime nrm = normalize_by_row($cvec_1);
  898.900 μs (10 allocations: 2.94 MiB)

But I also wanted to go check the looped possibilities so I have tried a few different versions using what I’ve picked up about @views and @inbounds:

function normalize_by_row_v2(arr)
    norms = similar(arr)
    for i in axes(arr,1)
       @views norms[i,:] = arr[i,:] / sqrt(sum(arr[i,:].^2))
    end
    return norms
 end

 function normalize_by_row_v3(arr)
    norms = Vector{Float64}(undef, size(arr)[1])
    for i in axes(arr,1)
       @views norms[i] = sqrt(sum(arr[i,:].^2))
    end
    
    return arr ./ norms
 end

 function normalize_by_row_v4(arr)
    norms = Vector{Float64}(undef, size(arr)[1])
    temp = arr.^2
    for i in axes(arr,1)
       @views norms[i] = sqrt(sum(temp[i,:]))
    end
    
    return arr ./ norms
 end

 function normalize_by_row_v5(arr)
    norms = Vector{Float64}(undef, size(arr)[1])
    temp = arr.^2
    for i in axes(arr,1)
       @inbounds @views norms[i] = sqrt(sum(temp[i,:]))
    end
    
    return arr ./ norms
 end

I guess I’ve had varying degrees of success, with the first one being both slow and memory hungry and the final looped version being nearly 20% faster than the broadcasted one:

julia> @btime nrm2 = normalize_by_row_v2($cvec_1);
  4.016 ms (110166 allocations: 13.03 MiB)

julia> @btime nrm3 = normalize_by_row_v3($cvec_1);
  1.999 ms (55086 allocations: 7.56 MiB)

julia> @btime nrm4 = normalize_by_row_v4($cvec_1);
  782.800 μs (6 allocations: 2.94 MiB)

julia> @btime nrm5 = normalize_by_row_v5($cvec_1);
  748.200 μs (6 allocations: 2.94 MiB)

The final one seems to be pretty good, but I’m wondering are there any further improvements that could be made to such a function?

I aslo thought I could use LoopVectorization and @turbo in v5 of the function, but it doesn’t really work and throws an error:

ERROR: ArgumentError: invalid index: VectorizationBase.MM{4, 1, Int64}<1, 2, 3, 4> of type VectorizationBase.MM{4, 1, Int64}

So can I get any faster? Is there some other important julia concept that I’ve missed?

Thanks in advance!

DNF · August 17, 2021, 9:46am

Yes. Julia arrays are column major, so you should work along columns, not rows.

lmiq · August 17, 2021, 9:47am

In the first one, the allocations are probably because of an intermediate vector created inside sum. Try this:

(I’m not sure if that works with the dims keyword)

In all cases probably you can avoid the creation of intermediary arrays completely by computing the norms one by one in a loop over rows.

(If you can use a row major structure and normalize by column, as pointed above, that will make a huge difference)

jipolanco · August 17, 2021, 9:56am

Don’t be afraid to use for loops. For example:

function normalize_by_row_v7(arr)
    u = similar(arr)
    @inbounds for i in axes(arr, 1)
        anorm2 = zero(eltype(arr))
        for j in axes(arr, 2)
            anorm2 += arr[i, j]^2
        end
        anorm = sqrt(anorm2)
        for j in axes(arr, 2)
            u[i, j] = arr[i, j] / anorm
        end
    end
end

# Original function using broadcasting
@btime normalize_by_row($cvec_1);
  1.033 ms (10 allocations: 2.67 MiB)

# Function defined above
@btime normalize_by_row_v7($cvec_1);
  472.210 μs (2 allocations: 1.14 MiB)

You can probably do even better by using @turbo. And as @DNF said, this would probably be much faster if the vectors were aligned along the first dimension, as in cvec_1 = rand(3, 50000).

DNF · August 17, 2021, 9:57am

Work along columns, and try to avoid allocations:

function normalize_loop!(arr)
    @views for i in axes(arr, 2)
        v = sqrt(sum(abs2, arr[:, i]))
        arr[:, i] ./= v
    end
    return arr
end

This has zero allocations.

jpmorr · August 17, 2021, 10:55am

This is quick, but it’s not returning the correct answer unless I transpose the array before calling the function, which is suspect would make it slower than @jipolanco’s solution. I need to have a better look after lunch to see the difference in these.

It seems that I should possibly just think more about rewriting everything to be column-major in future though.

Also, should there be an @inbounds for the for loop here?

DNF · August 17, 2021, 10:56am

Yes, my point is that you should make sure that your data is column major in the first place.

DNF · August 17, 2021, 11:00am

I tried @inbounds but it made no difference.

jpmorr · August 17, 2021, 11:05am

This is much quicker than mine, but unfortunately @turbo doesn’t work. Have I done something wrong?


 function normalize_by_row_v7(arr)
    u = similar(arr)
    @turbo @inbounds for i in axes(arr, 1)
        anorm2 = zero(eltype(arr))
        for j in axes(arr, 2)
            anorm2 += arr[i, j]^2
        end

        anorm = sqrt(anorm2)
        for j in axes(arr, 2)
            u[i, j] = arr[i, j] / anorm
        end
    end
    return u
end



julia> ERROR: LoadError: LoadError: LoadError: LoadError: type NewvarNode has no field args
Stacktrace:
 [1] getproperty(x::Core.NewvarNode, f::Symbol)
   @ Base .\Base.jl:33
 [2] substitute_broadcast(q::Expr, mod::Symbol, inline::Bool, u₁::Int8, u₂::Int8, v::Int8, threads::Int64, warncheckarg::Int64)
   @ LoopVectorization C:\Users\jpmor\.julia\packages\LoopVectorization\ZPTim\src\constructors.jl:47

Seem like it possibly can’t be made any quicker. Not bad though, already about 4x quicker than the existing python implementation. I’m slowly beginning to think it would be worth all the effort to convert everything to julia!!

DNF · August 17, 2021, 11:13am

Here’s a pretty quick version, if you are willing to use StaticArrays. This is useful if you know that one of the dimensions is a fixed number, for example if you are working in 3D space:

using StaticArrays, LinearAlgebra

mynormalize!(arr::AbstractArray{<:AbstractVector}) = (arr ./= norm.(arr))

X = rand(SVector{3, Float64}, 50_000);

jl> @btime mynormalize!($X);
  116.500 μs (0 allocations: 0 bytes)

jpmorr · August 17, 2021, 11:20am

Yes, I can nearly always one dimension so it seems that StaticArrays is something I need to learn more about too.

This is nearly 10x quicker than my well optimised python implementation so this is more than I hoped for.

DNF · August 17, 2021, 11:22am

You can also take a look at https://github.com/mateuszbaran/HybridArrays.jl which allows you to keep the data as a matrix. But frankly, I think that for storing a set of coordinates, it makes more sense to keep them as a vector of vectors, in which case StaticArrays are brilliant.

DNF · August 17, 2021, 11:33am

Another possibility: I cannot get any speedup from @tturbo, and Threads.@threads need longer vectors to become efficient on my computer. But if I directly use Polyester.jl:

using Polyester
function mynormalize_p!(arr::AbstractArray{<:AbstractVector})
    @batch for i in eachindex(arr)
        arr[i] = arr[i] ./ norm(arr[i])
    end
    return arr
end

jl> @btime mynormalize_p!($X);
  26.200 μs (1 allocation: 16 bytes)

with 8 threads.

Elrod · August 17, 2021, 11:45am

I’ll fix this in a future release, but for now @turbo requires “perfectly nested” loops. Meaning that for now it requires

for i in I
    ...
    for j in J
        ...
    end
    ...
end

instead of

for i in I
    ...
    for j in J
        ...
    end
    ...
    for j in J
        ...
    end
    ...
end

So you can try @fastmath, or add either @turbo or @simd to the inner loops.

Or, this

using LoopVectorization, StrideArraysCore

normalize_by_row_v8(arr) = normalize_by_row_v8!(similar(arr), arr)
function normalize_by_row_v8!(
  u::AbstractMatrix{T}, arr::AbstractMatrix{T}
  ) where {T}
  W = LoopVectorization.pick_vector_width(eltype(arr))
  W4 = W * LoopVectorization.StaticInt(4)
  buffer = StrideArray{T}(undef, (W4,))
  N = size(arr,1)
  for ii in 0:W4:N-W4
    @turbo for i in 1:W4
      anorm2 = zero(eltype(arr))
      for j in axes(arr, 2)
        anorm2 += arr[i+ii, j]^2
      end
      buffer[i] = sqrt(anorm2)
    end
    @turbo for i in 1:W4
      for j in axes(arr, 2)
        u[i+ii, j] = arr[i+ii, j] / buffer[i]
      end
    end
  end
  @inbounds @fastmath for i in (N & -W4)+1:N
    anorm2 = zero(eltype(arr))
    for j in axes(arr, 2)
      anorm2 += arr[i, j]^2
    end
    anorm = sqrt(anorm2)
    for j in axes(arr, 2)
      u[i, j] = arr[i, j] / anorm
    end
  end
  return u
end
using StaticArrays, LinearAlgebra
normalize_sarray(x) = x ./ norm.(x)

x = rand(50_000,3);
x2 = StrideArray(x, (50_000, StaticInt(3)));
y = copy(reinterpret(reshape, SVector{3,Float64}, x'));

normalize_by_row_v8(x) ≈ normalize_by_row_v8(x2) ≈ reinterpret(reshape, Float64, normalize_sarray(y))'

@benchmark normalize_by_row_v8($x)
@benchmark normalize_sarray($y)
@benchmark normalize_by_row_v8($x2)

I get

julia> normalize_by_row_v8(x) ≈ normalize_by_row_v8(x2) ≈ reinterpret(reshape, Float64, normalize_sarray(y))'
true

julia> @benchmark normalize_by_row_v8($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   78.538 μs …   1.461 ms  ┊ GC (min … max):  0.00% … 79.56%
 Time  (median):      93.183 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   114.820 μs ± 118.901 μs  ┊ GC (mean ± σ):  16.23% ± 13.86%

  ▆█▅▃▁                                                         ▁
  ███████▅▄▁▃▁▁▁▁▁▃▆▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▅▆██▇▆▆▆▆ █
  78.500 μs     Histogram: log(frequency) by time    789.000 μs <

 Memory estimate: 1.14 MiB, allocs estimate: 2.

julia> @benchmark normalize_sarray($y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  152.999 μs …   1.299 ms  ┊ GC (min … max):  0.00% … 67.91%
 Time  (median):     155.702 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   180.734 μs ± 118.867 μs  ┊ GC (mean ± σ):  10.32% ± 12.76%

  █▅▃▂▁                                                         ▁
  ██████▇▆▇▅▄▅▁▁▁▁▁▃▄▇▃▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▃▅▆▆▇▇▆▇▇▇ █
  153.000 μs    Histogram: log(frequency) by time    903.000 μs <

 Memory estimate: 1.14 MiB, allocs estimate: 2.

julia> @benchmark normalize_by_row_v8($x2)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   66.263 μs …   1.288 ms  ┊ GC (min … max):  0.00% … 84.27%
 Time  (median):      93.333 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   113.447 μs ± 113.662 μs  ┊ GC (mean ± σ):  15.93% ± 13.89%

  ▁▅█▃▂▁                                                        ▁
  ███████▆▆▅▁▁▁▄▃▁▁▁▄▇▅▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▇██▇▇▆ █
  66.300 μs     Histogram: log(frequency) by time    795.000 μs <

 Memory estimate: 1.14 MiB, allocs estimate: 2.

Of course, adding parallelism (e.g. with @batch) to the outer for ii in 0:W4:N-W4 loop should speed it up even more, but in that case we’d need to create the buffer inside the loop, to be thread safe.

lmiq · August 17, 2021, 1:02pm

If you are working with particles or points (the dimension 3 suggests so), StaticArrays makes this very natural:

using BenchmarkTools
using StaticArrays
using LinearAlgebra

struct Point <: FieldVector{3,Float64}
  x::Float64
  y::Float64
  z::Float64
end

points = rand(Point,50_000)

function normalize_points!(points)
    @inbounds @simd for i in eachindex(points)
        points[i] = points[i]/norm(points[i])
    end
end

which gives here:

julia> @btime normalize_points!($points)
  139.368 μs (0 allocations: 0 bytes)

That is roughly equivalent to the single-liner above (mynormalize!(arr) = (arr ./= norm.(arr)), use what you feel more comfortable with (keeping in mind that many times the loop allows you do things that are hard to write as single liners).

Topic		Replies	Views
Optimizing iteration over slices of multiple matrices Performance question , array	14	1893	December 30, 2018
Performance improvements for simple molecular surface caclulation Performance	15	633	September 14, 2020
When should I write loops or vectorised calls? General Usage	17	1789	December 1, 2020
Orders of magnitude runtime difference in row-wise norm Performance	5	362	March 21, 2023
Arithmetic broadcasting in Julia 5x slower than MATLAB Performance	17	1071	May 26, 2022

Improving function performance - Broadcasting vs Loops

Related topics