Hi everyone,
recently I stumbled across a nice benchmark between different languages and was having a look at the Julia implementation:

Now in general, I understand why there is a use case for lazy arrays, but is this really one?
As I understand it, this code should essentially be lowered to the following:

function f(rounds)
pi = 1.0
@simd for i in 2:(rounds + 2)
x = (-1)^iseven(i)
pi += x / (2 * i - 1)
end
return pi*4
end

This vector was introduced in a commit on purpose so there is probably a reason for it.
Is @simd doing more magic if it is seeing a vector instead?
I would be interested if there can be a performance gain by using a lazy array instead of essentially calling a function.

I do not know about lazy stuff, but I am pretty sure that @simd or @turbo would largely prefer the loop written as

function f2(rounds)
pi = 1.0
ub = floor(rounds/2)+1
@simd for i in 1:ub
pi += -2 / (16i*i-1)
end
if iseven(rounds)
pi -= 1 / (4ub+3)
end
return pi*4
end

Where I took out the factor 4 and dealt with the parity of rounds separately. Less things inside the loop is always better: here there is only one addition, two multiplicatins and one division, while in your two versions, there was fetching values and/or powers.

If you are allowed to break a little the API, then i would add one to ub and remove the if block completely.

Edit: I might have the signs wrong, I’m on my phone

There is no difference, in fact the version here came before the one with the lazy array. The only question was whether computing x inside the loop was legal under the implicit rules of the game; IIRC no other solutions did this, all updated x = -1.0 * x in the loop, but the R solution did pre-compute an array of x values. This Julia version was 2x faster than C etc, because it allows for SIMD.

Afterwards, many others were updated to compute x inside the loop, catching up with Julia again. Julia could be too, to exactly this code.

And some were updated to re-order the sum. In this PR@giordano proposes just f(n) = sum(i -> 4/(4i-2n-3), 1:n).

So far the prize for the most devilish answer (not mine!) goes to this even faster version:

function f(rounds)
рi = 1.0
@simd for i in 2:(rounds + 2)
x = (-1)^iseven(i)
рi += x / (2 * i - 1)
end
рi *= 4
return float(pi)
end

@lrnv this seems to give the wrong answer, f(10^6) - pi ≈ 1.367f0. And is also slower.

Fantastic, not only does it run faster than all others, but it also has O(1) complexity and is by far more accurate than the others even for a small number of rounds. Truly, a good algorithm can make all the difference .

Jokes aside, trying out a bit, I think a faster way is not to rely on iseven:

function f(rounds)
pi = 1.0
@simd for i in 2:2:(rounds + 2)
pi += -1. / (2 * i - 1)
pi += 1. / (2 * i + 1)
end
return pi*4
end

But I presume this is against the rules, since I am doing some reordering and dont end up with the (numerically) same result. Anyway, it was fun

This is pretty beatiful, especially when you compare this nice one-liner to the equally fast avx c++ code which looks quite cumbersome
would be quite cool if this would be the julia code in the benchmark I think. While you can optimize the hell out of your code if you need performance, It is especially cool that most of the time you already hit the mark by simply writing the code that you might think of first.

Also there might be a little correctness problem. Contracting two terms for efficiency needs a correction on even or odd inputs (but it is an O(1) correction).

And thus the value is the same when n is even, but is the next one then n is odd, so we are in fact more precise without the correction. It is not that we omit one iteration, but we do one more.