In scaling a vector I tried a number of implementations to determine the best performance. Initially I implemented it with global variables and a for loop. Then reading the performance tips (and discourse) I realized that was bad. I also wanted to figure out the penalty for a try catch block. I am finding that broadcasting is about 2 orders of magnitude faster than a for loop, which seems unreasonable. At first I attributed this to global variables, but a purely local implementation didn’t seem to help. I tried looking at @code_llvm output but am not adept enough to read the results.
I am wondering why I am getting these results? The code and results are below. The number of allocations must be a clue.
Focusing on scalefor() and scalebroadcast() would be appropriate.
using BenchmarkTools
using Revise
# Some trials to ensure that the scaling algorithm is as efficient as possible
# broadcasting is two orders of magnitude faster
n = 2_000_000
function scaletryglobalfor(scale)
try
# open file to obtain data
global data=rand(n)
@inbounds for (j, value) in enumerate(data)
global data[j] = scale * value;
end
finally
#close file
end
data
end
function scaleglobalfor(scale)
global data=rand(n)
@inbounds for (i,value) in enumerate(data)
global data[i] = scale * value
end
data
end
function scalefor(scale)
datalocal=rand(n)
@inbounds for (i,value) in enumerate(data)
datalocal[i] = scale * value
end
data
end
function scaledotfor(scale)
datalocal=rand(n)
@inbounds for i in 1:length(datalocal)
datalocal[i] *= scale
end
data
end
function scaletryglobalbroadcast(scale)
global data=rand(n)
try
global data .*= scale
finally
# close file
end
data
end
function scaleglobalbroadcast(scale)
global data=rand(n)
global data .*= scale
data
end
function scalebroadcast(scale)
data=rand(n)
data .*= scale
end
julia> @btime scaletryglobalfor(2.2); 308.560 ms (13999492 allocations: 350.94 MiB)
julia> @btime scaleglobalfor(2.2); 307.769 ms (13999492 allocations: 350.94 MiB)
julia> @btime scalefor(2.2); 320.688 ms (13999492 allocations: 350.94 MiB)
julia> @btime scaledotfor(2.2); 235.144 ms (11998982 allocations: 228.87 MiB)
julia> @btime scaletryglobalbroadcast(2.2); 4.922 ms (5 allocations: 15.26 MiB)
julia> @btime scaleglobalbroadcast(2.2); 4.888 ms (5 allocations: 15.26 MiB)
julia> @btime scalebroadcast(2.2); 4.863 ms (5 allocations: 15.26 MiB)
In both of these you work on datalocal, but then return data, which is presumably a global variable. You also access n, which is a global.
I suggest you restart Julia, get rid of all globals in your code, they are a curse, don’t use them until you are more experienced, if ever.
Then you will get errors if you make mistakes like that again.
Furthemore, you are also timing the generation of your data, by calling rand, which probably takes more time than the loop. Use randoutside your function, and pass the data in as an argument.
You’ve got a lot of different pieces of code here, and many of the versions aren’t doing what you’re expecting. For example, scaledotfor writes to datalocal but returns data, which is a global variable created…somewhere else.
The fundamental issue with all of your loop-based code is that you’re still using the non-constant global variable n. Fixing that resolves the performance issue entirely:
julia> function scaledotfor(scale, n)
data = rand(n)
@inbounds for i in 1:length(data)
data[i] *= scale
end
data
end
scaledotfor (generic function with 2 methods)
julia> function scalebroadcast(scale, n)
data = rand(n)
data .*= scale
end
scalebroadcast (generic function with 1 method)
julia> @btime scaledotfor(2.2, 2_000_000);
2.487 ms (2 allocations: 15.26 MiB)
julia> @btime scalebroadcast(2.2, 2_000_000);
2.568 ms (2 allocations: 15.26 MiB)
Thanks for your help. I have fixed things up to good avail taking suggestions from a number of posts.
This is the way things now look.
using BenchmarkTools
using Revise
using LoopVectorization
n = 2_000_000
data=rand(n)
sc = 0.999999999999
function scaletryglobalfor(scale)
try
# open file to obtain data
@inbounds for (i, value) in enumerate(data)
global data[i] = scale * value;
end
finally
#close file
end
data
end
function scaleglobalfor(scale)
@inbounds for (i,value) in enumerate(data)
global data[i] = scale * value
end
data
end
function scalefor(datalocal, scale)
@inbounds for (i,value) in enumerate(datalocal)
datalocal[i] = scale * value
end
datalocal
end
function scaledotfor(datalocal, scale)
@inbounds for i in 1:length(datalocal)
datalocal[i] *= scale
end
datalocal
end
function scaledotforloopvectorized(datalocal, scale)
@turbo for i in 1:length(datalocal)
datalocal[i] *= scale
end
datalocal
end
function scaletryglobalbroadcast(scale)
try
global data .*= scale
finally
# close file
end
data
end
function scaleglobalbroadcast(scale)
global data=rand(n)
global data .*= scale
data
end
function scalebroadcast(datalocal, scale)
datalocal .*= scale
end
With the following results.
348.019 ms (13999490 allocations: 335.69 MiB)
0.6394983950511726
342.780 ms (13999490 allocations: 335.69 MiB)
0.6394983950511726
1.133 ms (0 allocations: 0 bytes)
0.6394983950511726
1.176 ms (0 allocations: 0 bytes)
0.6394983950511726
1.067 ms (0 allocations: 0 bytes)
0.6394983950511726
1.417 ms (3 allocations: 80 bytes)
0.6394983950511726
5.270 ms (5 allocations: 15.26 MiB)
0.6394983950511726
1.156 ms (0 allocations: 0 bytes)
0.6394983950511726
julia>
This is with 1 thread in VSCode. I have not yet tested the LoopVectorization with multiple threads but even with a single thread it seems a bit faster.
I doubt that multiple threads will increase performance. With this scaling operation you are probably memory bound, i.e. most of the time the data is transferred between main memory and CPU and comparably little time is spent on the actual multiplication. This is typical for level 1 BLAS routines.
Before you even start thinking about threading or SIMD via LoopVectorization.jl: you are still using globals; don’t. Note: using “globals” doesn’t necessarily mean you have the global keyword anywhere; it just means you are using a variable inside of a function that is not passed into the function as an argument. Avoiding this (or at least declaring those globals “const”) will probably give you a 2 order of magnitude speedup.
Edit: I now see (sorry; on a phone right now) that your last implementation (and maybe some others) actually do avoid globals. At that point, yes, do experiment with threading/SIMD. For the future, the fact that those implementations no longer allocate is a good sign.