Question: Improve the speed of for loop

I have written massive ‘for loop’ code as follows:

Meta_start=390; Meta_end=400; Meta_div=trunc(Int,((Meta_end-Meta_start)*100)); #temporary
        Meta_cal=LinRange(Meta_start,Meta_end,Meta_div); k=1:1:Meta_div #::Vector{Int64}
            I_sharp=Array{Float64,2}(undef,150,27);
            I_broad=Array{Float64,1}(undef,Meta_div);

            Re=rand(5,11); FCF=rand(5,11); IB=0.1;
            Meta=rand(150,27,55); Meta_shift=zeros(55);

            for v_C=1:1:5
                for v_B=1:1:11
                    numb_v=(v_B+11*(v_C-1))
                    for bra=1:1:27
                         for J1=1:1:150 # 150
                            I_sharp[J1,bra]=Re[v_C,v_B]^2+*(1/(Meta[J1,bra,numb_v].-Meta_shift[numb_v])^4)*FCF[v_C,v_B]
                            I_broad[k]+=I_sharp[J1,bra].*exp.(-2*(Meta_cal[k].-(Meta[J1,bra,numb_v].-Meta_shift[numb_v])).^2/IB)
                        end 
                    end
                end
            end

It took 3.448 seconds with 8691813 allocations and 10.30 GiB on my computer.
I tried to use @inbounds @simd and mul!(C,A,B), but it didn’t work, or I guess I used it the wrong way.
How can I improve the speed of this code?

Some parts can probably be vectorized using loop vectorization, the a.+b or similar vectorizations are convenient but not efficient, do not vectorize the code this way if you care about performance. Also, see
Transducers.jl or other packages from the Juliafolds to efficiently do the reduction of I_broad.

1 Like

are you benchmarking in global scope? put this in a function.

7 Likes

Hardcoding iteration limits isn’t a great idea, you should rather use axes or eachindex. Anyway, use 1:5, not 1:1:5, otherwise the compiler cannot be certain about the stepsize in the type domain.

But most importantly, put your code in a function, and avoid global variables. And definitely read the performance tips

4 Likes

You have +* in one of your expressions. Not sure what you intended here, but I believe the asterisk has no effect since it is followed by a parentheses containing a single quantity, interpreted as a product of a set of one numbers, ie, the number itself.

Aside: As I understand it, all of these terms are scalars, so the dots do nothing. (They mostly don’t hurt, but you might as well use *, - , ^ instead of .*, .-, .^.)

This code has several problems but one is extremely serious: I_broad is never initialized. It is incremented in the innermost loop but since it is never initialized, it contains random garbage. This should be fixed before you ask for any more help with optimizing its speed.

I think it’s okay actually.

arr = rand(1000000)
function test_range_1(arr)
    ans = 0
    for i in 1:1:1000000
        ans += arr[i]
    end
    ans
end
function test_range_2(arr)
    ans = 0
    for i in 1:1000000
        ans += arr[i]
    end
    ans
end


using BenchmarkTools

@btime test_range_1(arr)
@btime test_range_2(arr)

Originally, this function was inside the function, and the benchmark time I posted was run inside the function.

Yeah, I’m reading the performance tips. and unfortunately, using 1:5 was not make my code faster :(.

I minimized the vectorization following your comment, but the effect is little.
and Thank you for introducing the interesting package.
It seems that [Transducers.Scan] and [Transducers.Iterated] could be applied in my code.
could you give me an example? I have no idea how to use it on I_broad.

Thanks for pointing out the typo. The original intent was *.

It’s more about style than performance, though. 1:5 is nicer. But, really, you should use eachindex or axes. Hardcoding iteration limits is very error-prone.

1 Like
I_broad=zeros(Float64, Meta_div) # zeros, not random memory
...
# fixed .* typo, got rid of dotted operator for scalars (no difference)
I_sharp[J1,bra]=Re[v_C,v_B]^2*(1/(Meta[J1,bra,numb_v]-Meta_shift[numb_v])^4)*FCF[v_C,v_B]

# @view prevents .+= from allocating I_broad[k] on the right side
# dot all the functions over arrays to fuse into 1 in-place loop
@view(I_broad[k]) .+= I_sharp[J1,bra].*exp.(-2 .*( Meta_cal[k] .-(Meta[J1,bra,numb_v]-Meta_shift[numb_v])).^2 ./IB)

These reduce the allocations to 8 and 1.74MiB in a function on v1.8.5; I’m not sure why for 2 of them, only 6 things are allocated before the loop (EDIT it seems to be an effect of filling arrays. If you omit the loop and change all undef to zeros, you get those allocation numbers. If you omit the loop and use the original 2 undef lines, you get 5 allocations 1.701 MiB.) I just noticed that the k was not a scalar index so I had to pay attention to where any Arrays were being allocated e.g. the LinRange at unfused .^. Not sure why you’re slicing with k=1:1:Meta_div when the arrays seem to be length Meta_div already.

2 Likes

The code performance improve from 1.97 sec (1336598 allocations: 10.12 GiB) to 1.32 sec (8 allocations: 1.74 MiB)!
The reason for doing ‘slicing k’ was for the vectorization of I_broad, and it seems unnecessary.

I_broad+= I_sharp[J1,bra].*exp.(-2 .*( Meta_cal .-(Meta[J1,bra,numb_v]-Meta_shift[numb_v])).^2 ./IB)

It reduced the time but still requires a lot of memory and allocation.
In this case, I think I need to use something other than the @view.

Thank you for the detailed example, Benny. it is really helpful.

The reason @view is used for slices is because I_broad[k] on the right side of I_broad[k] .= I_broad[k] .+ ... is allocating a copy. You don’t need @view if you don’t slice because I_broad on the right side of I_broad .= I_broad .+ ... reuses the array, no allocation needed. Try this line and see if you reduce the allocations again:

I_broad .+= I_sharp[J1,bra].*exp.(-2 .*( Meta_cal .-(Meta[J1,bra,numb_v]-Meta_shift[numb_v])).^2 ./IB)

I’m getting the feeling that your code does so much number crunching that it vastly outweighs the time spent allocating and deallocating, so reducing allocations has a small improvement on runtime.

1 Like

I have tried the code you suggested, and it shows a 20 % speed-up compared to mine.
Even though the number of allocations did not decrease, memory did decrease 12 times.
A period ‘.’ made it like this. (reducing process time and memory)
It seems trivial, but important… I’ll have to pay attention it for optimization.
Thank you. :slight_smile:

Odd, the .+= compared to += would save an allocation of an array on the right side. I quickly ran the code with the I_broad+= line versus the I_broad .+= line, it went from 445510 to the 8 allocations (1.74 MiB) as expected.

I cannot explain the memory decreasing by 12 times, that memory is all allocated before the loop, so did you edit the arrays to be smaller?

1 Like

Maybe it is due to the code read the data using readdlm() not using rand(), in real code.
But I don’t know why the two cases are not the same, both of them is float 64 type and the same size array. (readdlm demand more allocations than rand()).

julia> @benchmark f1!($I_sharp, $I_broad, $Re, $Meta, $Meta_shift, $FCF, $Meta_cal, $k, $IB)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 2.981 s (0.00% GC) to evaluate, with a memory estimate of 0 bytes, over 0 allocations.

julia> @benchmark f2!($I_sharp, $I_broad, $Re, $Meta, $Meta_shift, $FCF, $Meta_cal, $k, $IB)BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  120.850 ms … 121.046 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     120.854 ms               ┊ GC (median):    0.00% Time  (mean ± σ):   120.877 ms ±  63.380 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█                                                             
  ██▇▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  121 ms           Histogram: frequency by time          121 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

f2! uses @turbo.

Meta_start=390; Meta_end=400; Meta_div=trunc(Int,((Meta_end-Meta_start)*100)); #temporary
Meta_cal=LinRange(Meta_start,Meta_end,Meta_div); k=1:Meta_div #::Vector{Int64}
I_sharp=Array{Float64,2}(undef,150,27);
I_broad=Array{Float64,1}(undef,Meta_div);

Re=rand(5,11); FCF=rand(5,11); IB=0.1;
Meta=rand(150,27,55); Meta_shift=zeros(55);

function f1!(I_sharp, I_broad, Re, Meta, Meta_shift, FCF, Meta_cal, K, IB)
  @inbounds @fastmath for v_C=1:5
    for v_B=1:11
      numb_v=(v_B+11*(v_C-1))
      for bra=1:27
        for J1=1:150 # 150
          I_sharp[J1,bra]=Re[v_C,v_B]^2+*(1/(Meta[J1,bra,numb_v]-Meta_shift[numb_v])^4)*FCF[v_C,v_B]
          for k = K
            I_broad[k]+=I_sharp[J1,bra]*exp(-2*(Meta_cal[k]-(Meta[J1,bra,numb_v]-Meta_shift[numb_v]))^2/IB)
          end
        end 
      end
    end
  end
end
using LoopVectorization
function f2!(I_sharp, I_broad, Re, Meta, Meta_shift, FCF, Meta_cal, K, IB)
  @turbo for v_C=1:5
    for v_B=1:11
      numb_v=(v_B+11*(v_C-1))
      for bra=1:27
        for J1=1:150 # 150
          I_sharp[J1,bra]=Re[v_C,v_B]^2+*(1/(Meta[J1,bra,numb_v]-Meta_shift[numb_v])^4)*FCF[v_C,v_B]
          for k = K
            I_broad[k]+=I_sharp[J1,bra]*exp(-2*(Meta_cal[k]-(Meta[J1,bra,numb_v]-Meta_shift[numb_v]))^2/IB)
          end
        end 
      end
    end
  end
end

Note that LoopVectorization really wants UnitRange over StepRange, at least for k.

1 Like