Yes, llvm is fairly smart in general.
Also, my implementation can be made a bit faster by inlining hor_min
and adding @inbounds
to the calculation of min_index
.
Yes, llvm is fairly smart in general.
Also, my implementation can be made a bit faster by inlining hor_min
and adding @inbounds
to the calculation of min_index
.
But does smart help here? What guarantee is there that x
hasn’t been modified between the loads? Doesn’t the code require a new load, for correctness?
fixed
Be careful with that. LLVM will silently generate nonsense code for non-power-of-2 sizes. Afaiu there is no consensus in the llvm community whether this is a bug or whether this idiom for pmovmskb only works coincidentally. So you need pretty comprehensive unit tests for that, in order to yell if an llvm update breaks this.
The relevant point of contention appears to be that the bitcast langref refers to the “in memory representation” of the datatypes, but i1 has no in-memory representation. Yes, this is very silly and annoying. I hate language-lawyering with a passion.
Thanks @jling for bringing this up. I have been very interested in the discussion.
Just to give a bit of context here, in this problem the metric dij
is calculated for a set of clusters (usually 300-500 at the beginning) and each iteration we need to find this minimum value. Then two clusters are merged to one, and some of the dij
values are updated. The next minimum is found, a merger happens, and so on, until the process finishes.
This is why the realistic test is to take a random array of length, say, 400. Find the minimum, then reduce to 399, 398, 397, …, 1. Adding some update to the array values each iteration makes it even more realistic (prevents unrealistic caching).
This is my wrapper to test any findmin implementation f
on vector v
:
function run_descent(v::DenseVector{Float64}, f::T; perturb = 0) where T
# Ensure we do something with the calculation to prevent the
# compiler from optimizing everything away!
sum = 0.0
for n in length(v):-1:2
val, _ = f(v, n)
sum += val
# If one wants to perturb the array do it like this, which
# is a proxy for changing values as the algorithm progresses.
for _ in 1:min(perturb,n)
v[rand(1:n)] = abs(rand())
end
end
sum
end
I introduced the trick with LoopVectorisation
when I realised that the out-of-the-box Julia findmin
was really slow. However, now I am realising that even just a basic naive loop is a lot faster than findmin
, so that a lot of the speed up may have nothing to do with vectorisation.
I wrapped up all of the suggestions so far into a script and this is what I get on two machines:
Running the benchmark for an array of length 400
fast_findmin min: 25.333000000000002 μs; mean: 26.46452926525524 μs
basic_findmin min: 83.25 μs; mean: 85.91027592592592 μs
julia_findmin min: 360.791 μs; mean: 366.5392078431371 μs
naive_findmin min: 23.167 μs; mean: 36.22719819819824 μs
naive_findmin_reduce min: 24.875 μs; mean: 36.793294815987366 μs
fast_findmin_simd min: 42.291000000000004 μs; mean: 44.39575992348151 μs
Running the benchmark for an array of length 400
fast_findmin min: 16.721 μs; mean: 16.934860248447162 μs
basic_findmin min: 22.782 μs; mean: 23.006807827395896 μs
julia_findmin min: 82.84500000000001 μs; mean: 88.04729608404968 μs
naive_findmin min: 17.352000000000004 μs; mean: 33.093611011154984 μs
naive_findmin_reduce min: 18.884999999999998 μs; mean: 34.14370736589271 μs
fast_findmin_simd min: 16.54 μs; mean: 18.151618091451382 μs
The platform differences are revealing, particularly that the basic_findmin
is quite bad on the M2, but performs well on the x86_64. Julia’s findmin
is particularly bad on M2.
Nothing is quite beating the LoopVectorization
version fast_findmin
, but naive_findmin
is quite respectable on both platforms, though slower than basic_findmin
on x86. fast_findmin_simd
is good on x86, not great on M2 (which only has neon, as opposed to avx2).
fast_findmin_simd
is also pretty complicated, so I would not like to have that in my package (though FindFirstFunctions.jl
is terrifying!).
Microbenchmarks can be misleading so my next test is in the real code paths of JetReconstruction.jl
.
I’m quite surprised by this myself!
In my version the second argument n
is ignored, which means that always the whole vector is searched. Have you changed this already? Otherwise you need something like the following:
function naive_findmin(w, n)
u = view(w, 1:n)
x = @fastmath foldl(min, u)
i = findfirst(==(x), u)::Int
x, i
end
It is currently like this:
function naive_findmin(dij::DenseVector{T}, n) where T
x = @fastmath foldl(min, @view dij[1:n])
i = findfirst(==(x), dij)::Int
x, i
end
So, yes, in the index finder the value of n
is ignored - but as x
must be in the range 1:n
findfirst
will exit as soon as it finds the correct value, so it will never search beyond n
.
I did a quick test with your version and AFAICS the benchmark is unchanged.
And evidently the compiler does a good job with this code!
Apologies if this was already brought up, but I just saw this, which is the most perplexing to me yet:
julia> x = randn(4096);
julia> @benchmark findmin($x)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min … max): 6.067 μs … 21.583 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.083 μs ┊ GC (median): 0.00%
Time (mean ± σ): 6.226 μs ± 843.417 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▁ ▁▂ ▁
██▇▆▄▄▅▄███▇▅▄▄▃▄▄▄▁▃▄▁▄▃▄▃▃▆▆▄▆▅▃▃▅▄▃▃▃▄▃▅▄▁▅▄▄▃▅▃▃▄▄▄▄▁▄▄ █
6.07 μs Histogram: log(frequency) by time 10.9 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark minimum($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 13.600 μs … 57.200 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.700 μs ┊ GC (median): 0.00%
Time (mean ± σ): 13.996 μs ± 2.498 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂▁ ▁ ▁
███▅▅▄▆█▆▇▆▆▆▆▅▄▄▃▃▄▃▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▄▅▅ █
13.6 μs Histogram: log(frequency) by time 27.8 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
These results are consistent for me for many tests.
This is v1.11.1
to me there’s a cross-over point:
julia> x = randn(1024);
julia> @be findmin($x)
Benchmark: 3029 samples with 34 evaluations
min 861.588 ns
median 875.147 ns
mean 890.109 ns
max 4.051 μs
julia> @be minimum($x)
Benchmark: 3103 samples with 26 evaluations
min 1.024 μs
median 1.112 μs
mean 1.132 μs
max 4.352 μs
julia> x = randn(10240);
julia> @be findmin($x)
Benchmark: 3546 samples with 3 evaluations
min 8.372 μs
median 8.553 μs
mean 8.551 μs
max 12.413 μs
julia> @be minimum($x)
Benchmark: 3188 samples with 8 evaluations
min 3.334 μs
median 3.450 μs
mean 3.451 μs
max 5.810 μs
maybe you just need larger N
, larger than 4096
For me minimum
is slower in all cases at least up to 10^6.
julia> @btime findmin(x) setup=(x=randn(2^20))
1.567 ms (0 allocations: 0 bytes)
(-4.8003337992633694, 112476)
julia> @btime minimum(x) setup=(x=randn(2^20))
3.518 ms (0 allocations: 0 bytes)
-5.193793445487401
Not for me. Even up to 2^26 (=67M)
julia> @btime findmin(x) setup=(x=randn(2^26))
173.805 ms (0 allocations: 0 bytes)
(-5.225567009788859, 4044609)
julia> @btime minimum(x) setup=(x=randn(2^26))
234.388 ms (0 allocations: 0 bytes)
-5.577759186585374
oh ok sorry mis-read your last message, what CPU are you using?
12 × Intel(R) Core™ i7-9750H CPU @ 2.60GHz
interesting, on both Zen2 and Zen4 I see:
#zen2 24 × AMD Ryzen 9 3900X 12-Core Processor
julia> @btime findmin(x) setup=(x=randn(2^26))
81.600 ms (0 allocations: 0 bytes)
(-5.711457299002498, 57053893)
julia> @btime minimum(x) setup=(x=randn(2^26))
26.146 ms (0 allocations: 0 bytes)
-5.982782024669893
#zen4 16 × AMD Ryzen 7 7840HS
julia> @btime findmin(x) setup=(x=randn(2^26))
55.342 ms (0 allocations: 0 bytes)
(-5.408897097757254, 24735986)
julia> @btime minimum(x) setup=(x=randn(2^26))
17.472 ms (0 allocations: 0 bytes)
-5.696407087349619
can we split this thread into another findmin() faster than minimum()
on some CPU?