pre-allocating array is slow

I tried three ways of calculating the first difference of a series. To my disappointment, the 3rd method is 100 times as slow as the other two. So, in such a case, using the first two methods are better?

using BenchmarkTools
using Random

A=randn(1000)

println("B1")
@btime B1=A[2:end]-A[1:end-1]


println("B2")
@btime B2=@view(A[2:end])-@view(A[1:end-1])


println("B3")

@btime begin
    B3=Array{Float64}(undef, length(A)-1)
    @inbounds @simd for i in 1:length(A)-1
        B3[i]=A[i+1]-A[i]
    end
end

####################
output:

B1
  1.330 μs (8 allocations: 23.92 KiB)
B2
  646.524 ns (8 allocations: 8.14 KiB)
B3
  121.100 μs (5467 allocations: 108.97 KiB)
2 Likes

Welcome! Two things:

  • Please mark your code as such with ``` backtick blocks so it formats nicely here
  • Take a look at the Performance Tips manual page and in particular the very first section: You want to ensure the code you benchmark is inside a function. Do that and it should be the fastest of all three options.

What Matt said:

julia> function test(A)
           B = Array{Float64}(undef, length(A) - 1)
           @inbounds @simd for i ∈ length(A) - 1
               B[i] = A[i+1] - A[i]
           end
           return B
       end
test (generic function with 1 method)

julia> @btime test($A);
  111.547 ns (1 allocation: 7.94 KiB)
2 Likes

But also note that B2 is pretty much the same speed if it’s in a function, too. Pre-allocating here would only be a big win if you can do it outside of your performance-critical (or benchmarked) section of code.

1 Like

All tests should be inside functions and the arguments should be interpolated (i.e., the $ in @btime test($A);).

Ok. Thanks.

I tested the following four functions:

A=randn(1000)

function test1(A)
    B1=A[2:end]-A[1:end-1]
    return B1
end

function  test2(A)
    B2=@view(A[2:end])-@view(A[1:end-1])
    return B2
end

function test3(A)
    B = Array{Float64}(undef, length(A) - 1)
    @inbounds @simd for i ∈ length(A) - 1
        B[i] = A[i+1] - A[i]
    end
    return B
end

function test4(A)
    B = Array{Float64}(undef, length(A) - 1)
    for i ∈ length(A) - 1
        B[i] = A[i+1] - A[i]
    end
    return B
end

Then I tested each of them using commands such as @btime test1($A). The results are:

julia> @btime test1($A)
1.200 μs (3 allocations: 23.81 KiB)

@btime test2($A)
512.315 ns (1 allocation: 7.94 KiB)

julia> @btime test3($A)
162.712 ns (1 allocation: 7.94 KiB)

julia> @btime test4($A)
160.805 ns (1 allocation: 7.94 KiB)

So, the pre-allocating one is the fastest.
It looks like

@inbounds @simd 

doesn’t improve performance.

2 Likes

By the way, there is a bug in the loop functions. The for i in x syntax needs something iterable (which numbers are in Julia, so the code actually runs, but the output is wrong, i.e. test1(A) == test2(A) != test3(A) != test4(A)):

julia> for i ∈ 100
           println(i)
       end

100

After replacing it with for i in eachindex(B) I get timings for test3 and test4 which are a lot closer to test2 and also to diff(A), which is exactly what the function does, no?

2 Likes

You are right. After I change code

    @inbounds @simd for i ∈ length(A) - 1

to

    @inbounds @simd for i ∈ 1: length(A) - 1

The results are:
julia> @btime test1($A)
1.150 μs (3 allocations: 23.81 KiB)

@btime test2($A)
479.899 ns (1 allocation: 7.94 KiB)

julia> @btime test3($A)
623.626 ns (1 allocation: 7.94 KiB)

julia> @btime test4($A)
602.162 ns (1 allocation: 7.94 KiB)

Then I tried

c:\> julia -t 2   #changing 2 to 8 doesn't make any difference

Function test4 doesn’t improve, but test3 is as fast as test2.

That doesn’t have to do with -t 2, simd does not depend on multi-threading.