using BenchmarkTools
datasize = 10000000
data = randn(Float32, datasize)
#recip_vector = similar(data, Float32)
recip_vector = similar(data)
# 1)0.007099 seconds (1 allocation: 32 bytes)
@timev recip_vector .= 1 ./ data
# 2)2.602 ms (0 allocations: 0 bytes)
@btime @. $recip_vector .= 1.0 / $data
# 3)1.878 ms (0 allocations: 0 bytes)
@btime @. $recip_vector .= 1.0 ./ $data
# 4)3.545 ms (1 allocation: 32 bytes)
@btime recip_vector .= 1.0 ./ data
function reciprocal_threaded!(recip_vector, data)
Threads.@threads for i in 1:length(recip_vector)
#5)in eachindex(recip_vector) 924.500 μs (161 allocations: 19.80 KiB)
@inbounds recip_vector[i] = 1.0f0 / data[i]
end
end
# 6)895.400 μs (161 allocations: 19.80 KiB)
@btime reciprocal_threaded!($recip_vector, $data)
What happened and how can I continue to improve the efficiency of my code?
Hi!
It looks like you are benchmarking in global scope. For example running your 4th approach but inside of a function I get:
ing BenchmarkTools
datasize = 10000000
data = randn(Float32, datasize)
recip_vector = similar(data)
julia> function f!(recip_vector,data)
recip_vector .= 1 ./ data
end
f! (generic function with 1 method)
julia> @benchmark f!($recip_vector,$data)
BenchmarkTools.Trial: 540 samples with 1 evaluation.
Range (min … max): 6.936 ms … 30.488 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.857 ms ┊ GC (median): 0.00%
Time (mean ± σ): 9.244 ms ± 2.754 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▂█
███▇▅▅▄▃▃▃▃▃▃▃▃▃▄▃▃▃▃▃▃▃▃▃▃▄▃▃▄▃▃▃▃▂▃▃▂▂▃▂▁▂▂▂▂▁▂▃▂▁▁▁▁▁▁▂ ▃
6.94 ms Histogram: frequency by time 17 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Which is the correct result, since one would not expect this to allocate at all.
So first tip would be to put inside of functions and repeat benchmarking!
Kind regards
1 Like
Thank you very much! Why is vectorization slower than the for loop? Broadcast vectorization should be faster than the for loop, right?
using BenchmarkTools
datasize = 100000000
data = randn(Float32, datasize)
#recip_vector = similar(data, Float32)
recip_vector = similar(data)
# 1)0.007099 seconds (1 allocation: 32 bytes)
@timev recip_vector .= 1 ./ data
# 2)2.602 ms (0 allocations: 0 bytes)
@btime @. $recip_vector .= 1.0 / $data
# 3)1.878 ms (0 allocations: 0 bytes)
@btime @. $recip_vector .= 1.0 ./ $data
# 4)3.545 ms (1 allocation: 32 bytes)
@btime recip_vector .= 1.0 ./ data
function reciprocal_threaded!(recip_vector, data)
Threads.@threads for i in 1:length(recip_vector)
#5)in eachindex(recip_vector) 924.500 μs (161 allocations: 19.80 KiB)
@inbounds recip_vector[i] = 1.0f0 / data[i]
end
end
# 6)895.400 μs (161 allocations: 19.80 KiB)
@btime reciprocal_threaded!($recip_vector, $data)
function f!(recip_vector,data)
recip_vector .= 1 ./ data
end
@benchmark f!($recip_vector,$data)
@benchmark reciprocal_threaded!($recip_vector, $data)
@benchmark @. $recip_vector .= 1.0 / $data
@benchmark @. @inbounds $recip_vector .= 1.0 / $data
0.068253 seconds (1 allocation: 32 bytes)
elapsed time (ns): 68253100
gc time (ns): 0
bytes allocated: 32
pool allocs: 1
non-pool GC allocs: 0
minor collections: 0
full collections: 0
33.009 ms (0 allocations: 0 bytes)
32.921 ms (0 allocations: 0 bytes)
37.448 ms (1 allocation: 32 bytes)
17.150 ms (161 allocations: 19.80 KiB)
BenchmarkTools.Trial: 149 samples with 1 evaluation.
Range (min … max): 33.041 ms … 35.955 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 33.562 ms ┊ GC (median): 0.00%
Time (mean ± σ): 33.620 ms ± 361.079 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▂ █▃▂▂▂ ▃ ▂
▃▃▆▇▆▇▇▄▆█▇██▇█████▁███▁█▆▆▆▄▄▆▁▇▄▃▃▃▁▃▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
33 ms Histogram: frequency by time 34.9 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark reciprocal_threaded!($recip_vector, $data)
BenchmarkTools.Trial: 285 samples with 1 evaluation.
Range (min … max): 17.149 ms … 17.975 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 17.473 ms ┊ GC (median): 0.00%
Time (mean ± σ): 17.507 ms ± 161.855 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄ ▆▄▃▃ ▂▇ ██▁▃▃▁ ▄ ▃▄ ▁
▃▁▁▃▁▁▄▁▅▃▄▆█████████▇██████▇█▆▆▆▇▆▇██▆▅█▄▇▅▇▄▅▇▅▆▃▅▃▄▁▄▅▁▃▃ ▄
17.1 ms Histogram: frequency by time 17.9 ms <
Memory estimate: 19.80 KiB, allocs estimate: 161.
julia> @benchmark @. $recip_vector .= 1.0 / $data
BenchmarkTools.Trial: 150 samples with 1 evaluation.
Range (min … max): 32.929 ms … 43.245 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 33.263 ms ┊ GC (median): 0.00%
Time (mean ± σ): 33.411 ms ± 851.491 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▅▁ █▄▃ ▁
▃▁▄▆▄▃▁▄███▆███▇▇▅█▃▆█▅▃█▆▄▅▄▄▁▃▁▆▃▄▅▁▃▁▃▁▃▃▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▃ ▃
32.9 ms Histogram: frequency by time 34.3 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark @. @inbounds $recip_vector .= 1.0 / $data
BenchmarkTools.Trial: 150 samples with 1 evaluation.
Range (min … max): 32.657 ms … 42.639 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 33.010 ms ┊ GC (median): 0.00%
Time (mean ± σ): 33.321 ms ± 1.389 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▄▁
████▇▃▃▃▃▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
32.7 ms Histogram: frequency by time 42.5 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
So without having run the code, I would like to clarify that vectorization and threading is not necessarily the same thing. At some point (data size, length of data), you will see threads being faster than vectorization, since while vectorization is efficient, it is not having X threads at once doing work.
It is a good idea to continously benchmark as you are doing here to find exactly what is neeeded to increase performance
Kind regards
1 Like
Thank you very much! I always thought that vectorized broadcasting would be faster than using a For loop, but through these codes, I found that multi-threaded For loops are actually twice as fast, which is completely different from what I had in mind. I wonder if I can understand Julia’s approach to vectorized broadcasting, whether it is for the sake of coding convenience, wrapping multiple threaded For loops. The reason I tested these is partly because vectorization can maintain code logic consistency when running code on CPU and GPU, and more importantly, during CUDA’s CuArray computations, if vectorized broadcasting is twice as slow as For loops, that would have a significant impact. So, why not make these two methods equally fast? It would be even better if it could be faster.
I am sure you can figure out what happens during vectorization
In regards to CUDA I just wanted to add, that on GPU, in reality, you never have actual serial code execution. At minimum you would have a block with x threads, and these threads are run at once. When you use a library such as CUDA, which has implemented efficient methods for broadcasting operations, you should see actual “multi-threading” in a GPU context.
On why the compiler does not automatically turn .
into multi-threaded operations, I think that it is great it does not. In my view multi-threading is something one opts into, because not for all code cases is threading the fastest approach.
Kind regards
2 Likes