Performance of for loop vs. broadcast

In scaling a vector I tried a number of implementations to determine the best performance. Initially I implemented it with global variables and a for loop. Then reading the performance tips (and discourse) I realized that was bad. I also wanted to figure out the penalty for a try catch block. I am finding that broadcasting is about 2 orders of magnitude faster than a for loop, which seems unreasonable. At first I attributed this to global variables, but a purely local implementation didn’t seem to help. I tried looking at @code_llvm output but am not adept enough to read the results.

I am wondering why I am getting these results? The code and results are below. The number of allocations must be a clue.

Focusing on scalefor() and scalebroadcast() would be appropriate.

using BenchmarkTools
using Revise

# Some trials to ensure that the scaling algorithm is as efficient as possible
# broadcasting is two orders of magnitude faster

n = 2_000_000
function scaletryglobalfor(scale)
    try
        # open file to obtain data
        global data=rand(n)
     
        @inbounds for (j, value) in enumerate(data)
            global data[j] = scale * value;
        end

    finally
        #close file
    end
    data
end

function scaleglobalfor(scale)
    global data=rand(n)
    @inbounds for (i,value) in enumerate(data)
        global data[i] = scale * value
    end
    data
end

function scalefor(scale)
    datalocal=rand(n)
    @inbounds for (i,value) in enumerate(data)
        datalocal[i] = scale * value
    end
    data
end

function scaledotfor(scale)
    datalocal=rand(n)
    @inbounds for i in 1:length(datalocal)
        datalocal[i] *= scale
    end
    data
end


function scaletryglobalbroadcast(scale)
    global data=rand(n)
    try
        global data .*= scale
    finally
        # close file
    end
    data
end

function scaleglobalbroadcast(scale)
    global data=rand(n)
    global data .*= scale
    data
end

function scalebroadcast(scale)
    data=rand(n)
    data .*= scale
end


julia> @btime scaletryglobalfor(2.2);  308.560 ms (13999492 allocations: 350.94 MiB)

julia> @btime scaleglobalfor(2.2);  307.769 ms (13999492 allocations: 350.94 MiB)

julia> @btime scalefor(2.2);  320.688 ms (13999492 allocations: 350.94 MiB)

julia> @btime scaledotfor(2.2);  235.144 ms (11998982 allocations: 228.87 MiB)

julia> @btime scaletryglobalbroadcast(2.2);  4.922 ms (5 allocations: 15.26 MiB)

julia> @btime scaleglobalbroadcast(2.2);  4.888 ms (5 allocations: 15.26 MiB)

julia> @btime scalebroadcast(2.2);  4.863 ms (5 allocations: 15.26 MiB)

If you want to optimize loops LoopVectorization.jl might be worth looking into. Depending on what kin of vector you have of cause.

In both of these you work on datalocal, but then return data, which is presumably a global variable. You also access n, which is a global.

I suggest you restart Julia, get rid of all globals in your code, they are a curse, don’t use them until you are more experienced, if ever.

Then you will get errors if you make mistakes like that again.

Furthemore, you are also timing the generation of your data, by calling rand, which probably takes more time than the loop. Use rand outside your function, and pass the data in as an argument.

3 Likes

Your functions using datalocal are returning data ?

Edit: Got beat to it :wink:

You’ve got a lot of different pieces of code here, and many of the versions aren’t doing what you’re expecting. For example, scaledotfor writes to datalocal but returns data, which is a global variable created…somewhere else.

The fundamental issue with all of your loop-based code is that you’re still using the non-constant global variable n. Fixing that resolves the performance issue entirely:

julia> function scaledotfor(scale, n)
           data = rand(n)
           @inbounds for i in 1:length(data)
               data[i] *= scale
           end
           data
       end
scaledotfor (generic function with 2 methods)

julia> function scalebroadcast(scale, n)
         data = rand(n)
         data .*= scale
       end
scalebroadcast (generic function with 1 method)

julia> @btime scaledotfor(2.2, 2_000_000);
  2.487 ms (2 allocations: 15.26 MiB)

julia> @btime scalebroadcast(2.2, 2_000_000);
  2.568 ms (2 allocations: 15.26 MiB)

Also try with LoopVectorization

using LoopVectorization
function scaledotfor(scale, n)
           data = rand(n)
           @turbo for i in 1:length(data)
               data[i] *= scale
           end
           data
       end

Thanks for your help. I have fixed things up to good avail taking suggestions from a number of posts.

This is the way things now look.

using BenchmarkTools
using Revise
using LoopVectorization

n = 2_000_000
data=rand(n)
sc = 0.999999999999
function scaletryglobalfor(scale)
    try
        # open file to obtain data
     
        @inbounds for (i, value) in enumerate(data)
            global data[i] = scale * value;
        end

    finally
        #close file
    end
    data
end

function scaleglobalfor(scale)
    @inbounds for (i,value) in enumerate(data)
        global data[i] = scale * value
    end
    data
end

function scalefor(datalocal, scale)
    @inbounds for (i,value) in enumerate(datalocal)
        datalocal[i] = scale * value
    end
    datalocal
end

function scaledotfor(datalocal, scale)
     @inbounds for i in 1:length(datalocal)
        datalocal[i] *= scale
    end
    datalocal
end

function scaledotforloopvectorized(datalocal, scale)
    @turbo for i in 1:length(datalocal)
       datalocal[i] *= scale
   end
   datalocal
end

function scaletryglobalbroadcast(scale)
    try
        global data .*= scale
    finally
        # close file
    end
    data
end

function scaleglobalbroadcast(scale)
    global data=rand(n)
    global data .*= scale
    data
end

function scalebroadcast(datalocal, scale)
    datalocal .*= scale
end

With the following results.

  348.019 ms (13999490 allocations: 335.69 MiB)
0.6394983950511726

  342.780 ms (13999490 allocations: 335.69 MiB)
0.6394983950511726

  1.133 ms (0 allocations: 0 bytes)
0.6394983950511726

  1.176 ms (0 allocations: 0 bytes)
0.6394983950511726

  1.067 ms (0 allocations: 0 bytes)
0.6394983950511726

  1.417 ms (3 allocations: 80 bytes)
0.6394983950511726

  5.270 ms (5 allocations: 15.26 MiB)
0.6394983950511726

  1.156 ms (0 allocations: 0 bytes)
0.6394983950511726

julia> 

This is with 1 thread in VSCode. I have not yet tested the LoopVectorization with multiple threads but even with a single thread it seems a bit faster.

I doubt that multiple threads will increase performance. With this scaling operation you are probably memory bound, i.e. most of the time the data is transferred between main memory and CPU and comparably little time is spent on the actual multiplication. This is typical for level 1 BLAS routines.

Before you even start thinking about threading or SIMD via LoopVectorization.jl: you are still using globals; don’t. Note: using “globals” doesn’t necessarily mean you have the global keyword anywhere; it just means you are using a variable inside of a function that is not passed into the function as an argument. Avoiding this (or at least declaring those globals “const”) will probably give you a 2 order of magnitude speedup.

Edit: I now see (sorry; on a phone right now) that your last implementation (and maybe some others) actually do avoid globals. At that point, yes, do experiment with threading/SIMD. For the future, the fact that those implementations no longer allocate is a good sign.

3 Likes