Repeatedly summing displaced array, lots of allocations and slow

I have two one-dimensional arrays of length nT:

  • greensfunc is an array of Float64,
  • X is an array of Float64 zeros.

I want to repeatedly multiply greensfunc by a random number, and add the result to X, but for each sum only consider the last n-1-elements of X (and first n-1 of greensfunc):

using TimerOutputs

function summat(nT)
    
    @timeit to "Initlialising vector" T  = LinRange(0,1,nT);
    @timeit to "Initlialising vector" X = zeros(nT);

    @timeit to "Greens function" greensfunc = exp.(-T/0.5) .* sin.(2*pi*T/0.5)

    @timeit to "Integrating" begin
    for n in 1:nT
            X[n:end] =  (X[n:end]) + (sqrt(1/float(nT))*randn()).*(greensfunc[1:(end-n+1)])
    end
    end
    
    return X
    
end

to = TimerOutput();
@time sol = summat(10000);
show(to)

The loop however has lots of temporary allocations. I can make it a bit faster using some macros:

@inbounds X[n:end] = (@view X[n:end]) .+ (sqrt(1/float(nT))*randn()).*(@view greensfunc[1:(end-n+1)])

Then the allocations in the Integrating step decreases to 383MB. However this still seems a lot, and I think is the main slowdown for this code. I would like to make this much faster so I can increase nT by a factor 10^4, is there a better way?

.=

maybe? to allow fusion happen

2 Likes

Or use:

Which is the same suggested, but easier perhaps.

1 Like

Putting @. in front will also broadcast the rand() which is not the same.

So, @Ruvi_Lecamwasam, make sure to dot the assignment operator, otherwise there will be allocations on the right hand side, despite the views. And put @views in front of the whole expression.

Also, don’t do this

It doesn’t matter for performance, but it’s unidiomatic, just write 1/nT, this will produce a float, even though nT is an integer.

3 Likes

Here’s the effect:

function summat(nT)
    T  = LinRange(0,1,nT);
    X = zeros(nT);
    greensfunc = exp.(-T/0.5) .* sin.(2*pi*T/0.5)
    for n in 1:nT
            X[n:end] =  (X[n:end]) + (sqrt(1/float(nT))*randn()).*(greensfunc[1:(end-n+1)])
    end
    return X
end

function summat_(nT)
    T  = LinRange(0,1,nT);
    X = zeros(nT);
    greensfunc = exp.(-T ./ 0.5) .* sin.(2π .* T ./ 0.5)
    for n in 1:nT
        @views X[n:end] .= X[n:end] .+ (randn() / sqrt(nT)) .* greensfunc[1:(end-n+1)]
    end
    return X
end

Benchmarks:

jl> using BenchmarkTools

jl> @btime summat(10000);
  209.422 ms (71812 allocations: 1.49 GiB)

jl> @btime summat_(10000);
  7.363 ms (4 allocations: 156.41 KiB)
5 Likes

Although @DNF solution is perfect, I will just insist in the possibility of using @., because, personally, I find having to get all those points right very bug-prone. So, taking into consideration the fact that you do not want to broadcast the rand() call, you can use:

julia> function summat_(nT)
           T  = LinRange(0,1,nT);
           X = zeros(nT);
           greensfunc = exp.(-T ./ 0.5) .* sin.(2π .* T ./ 0.5)
           for n in 1:nT
               λ = rand()/sqrt(nT)
               @views @. X[n:end] = X[n:end] + λ * greensfunc[1:(end-n+1)] # no more dots
           end
           return X
       end

And, since we are there, why not use a loop and @tturbo (just for fun):

julia -t auto # 4 cores

julia> using LoopVectorization, BenchmarkTools

julia> @btime summat_(10000);
  9.129 ms (4 allocations: 156.41 KiB)

julia> function summat_loop(nT)
           Random.seed!(123)
           T  = LinRange(0,1,nT);
           X = zeros(nT);
           greensfunc = exp.(-T ./ 0.5) .* sin.(2π .* T ./ 0.5)
           for n in 1:nT
               λ = rand()/sqrt(nT)
               @tturbo for i in 1:length(greensfunc)-n+1
                 X[n+i-1] = X[n+i-1] + λ * greensfunc[i]
               end
           end
           return X
       end
summat_loop (generic function with 1 method)

julia> @btime summat_loop(10000);
  5.168 ms (8 allocations: 156.55 KiB)


edit: In this case you can just add @tturbo to the original broadcasted operation as well:

@views @tturbo X[n:end] .= X[n:end] .+ (randn() / sqrt(nT)) .* greensfunc[1:(end-n+1)]

and it works just as fine.

2 Likes