Usage of lazy vector instead of function

Salmon · November 4, 2022, 9:11am

Hi everyone,
recently I stumbled across a nice benchmark between different languages and was having a look at the Julia implementation:

github.com

niklas-heer/speed-comparison/blob/master/src/leibniz.jl

struct SignVector <: AbstractVector{Float64}
    len::Int
end
Base.size(s::SignVector) = (s.len,)
Base.getindex(::SignVector, i::Int) = Float64((-1)^iseven(i))

function f(rounds)
    xs = SignVector(rounds + 2)
    pi = 1.0

    @simd for i in 2:(rounds + 2)
        x = xs[i]
        pi += x / (2 * i - 1)
    end

    return pi*4
end

@static if abspath(PROGRAM_FILE) == @__FILE__
    rounds = parse(Int64, readchomp("rounds.txt"))

This file has been truncated. show original

Now in general, I understand why there is a use case for lazy arrays, but is this really one?
As I understand it, this code should essentially be lowered to the following:

function f(rounds)
    pi = 1.0

    @simd for i in 2:(rounds + 2)
        x = (-1)^iseven(i)
        pi += x / (2 * i - 1)
    end

    return pi*4
end

This vector was introduced in a commit on purpose so there is probably a reason for it.
Is @simd doing more magic if it is seeing a vector instead?
I would be interested if there can be a performance gain by using a lazy array instead of essentially calling a function.

lrnv · November 4, 2022, 9:28am

I do not know about lazy stuff, but I am pretty sure that @simd or @turbo would largely prefer the loop written as

function f2(rounds)
    pi = 1.0

	ub = floor(rounds/2)+1

    @simd for i in 1:ub
        pi +=  -2 / (16i*i-1)
    end
	if iseven(rounds)
		pi -= 1 / (4ub+3)
	end
    return pi*4
end

Where I took out the factor 4 and dealt with the parity of rounds separately. Less things inside the loop is always better: here there is only one addition, two multiplicatins and one division, while in your two versions, there was fetching values and/or powers.

If you are allowed to break a little the API, then i would add one to ub and remove the if block completely.

Edit: I might have the signs wrong, I’m on my phone

mcabbott · November 4, 2022, 10:00am

There is no difference, in fact the version here came before the one with the lazy array. The only question was whether computing x inside the loop was legal under the implicit rules of the game; IIRC no other solutions did this, all updated x = -1.0 * x in the loop, but the R solution did pre-compute an array of x values. This Julia version was 2x faster than C etc, because it allows for SIMD.

Afterwards, many others were updated to compute x inside the loop, catching up with Julia again. Julia could be too, to exactly this code.

And some were updated to re-order the sum. In
this PR @giordano proposes just f(n) = sum(i -> 4/(4i-2n-3), 1:n).

So far the prize for the most devilish answer (not mine!) goes to this even faster version:

function f(rounds)
     рi = 1.0
     @simd for i in 2:(rounds + 2)
         x = (-1)^iseven(i)
         рi += x / (2 * i - 1)
     end
     рi *= 4
     return float(pi)
 end

@lrnv this seems to give the wrong answer, f(10^6) - pi ≈ 1.367f0. And is also slower.

lrnv · November 4, 2022, 10:07am

Yes I saw, my bad. I corrected it with a correct answer, whihc is indeed much slower.

giordano · November 4, 2022, 10:13am

mcabbott:

So far the prize for the most devilish answer (not mine!) goes to this even faster version:
function f(rounds)
     рi = 1.0
     @simd for i in 2:(rounds + 2)
         x = (-1)^iseven(i)
         рi += x / (2 * i - 1)
     end
     рi *= 4
     return float(pi)
 end

For the curious, this is the benchmark:

Salmon · November 4, 2022, 10:45am

Fantastic, not only does it run faster than all others, but it also has O(1) complexity and is by far more accurate than the others even for a small number of rounds. Truly, a good algorithm can make all the difference .

Jokes aside, trying out a bit, I think a faster way is not to rely on iseven:

function f(rounds)
    pi = 1.0

    @simd for i in 2:2:(rounds + 2)
        pi += -1. / (2 * i - 1)
        pi += 1. / (2 * i + 1)
    end

    return pi*4
end

But I presume this is against the rules, since I am doing some reordering and dont end up with the (numerically) same result. Anyway, it was fun

lrnv · November 4, 2022, 11:20am

On my machine, the runtime is exactly the same. Note also that simd IS reordering stuffs anyway

giordano · November 4, 2022, 12:10pm

It isn’t like there are well defined rules and the maintainer of the repo has been pretty open to manual loop unrolling so far.

Salmon · November 4, 2022, 12:11pm

Interesting. Ive only tried it on my laptop (for which simd doesnt seem to be so great, admittedly) and here its quite a bit faster.

Dan · November 4, 2022, 5:13pm

This might be sorta not fair, but combining two divisions into one cuts time in half.

function f2(rounds)
    p = 1.0
    @simd for i in 2:2:(rounds + 2)
        p += -2.0 / (4*i^2 -1)
    end
    return p*4
end

Is it possible for Julia to understand this tranformation natively?

lrnv · November 4, 2022, 5:49pm

With one less multiplication per iter :

function f3(rounds)
    p = 2.0
    @simd for i in 2:2:(rounds + 2)
        p += 1 / (0.25-i*i)
    end
    return 2p
end

Although it’s negligeable compared to the division cost indeed.

EDIT : The @simd seems to be by default, so that :

f4(n) = 4-2*sum(i -> 1/(i*i-0.25), 2:2:(n+2))

is actually enough…

Salmon · November 6, 2022, 11:34am

This is pretty beatiful, especially when you compare this nice one-liner to the equally fast avx c++ code which looks quite cumbersome
would be quite cool if this would be the julia code in the benchmark I think. While you can optimize the hell out of your code if you need performance, It is especially cool that most of the time you already hit the mark by simply writing the code that you might think of first.

lrnv · November 6, 2022, 1:07pm

Yeah, except telling people that “this is the code that we think of first” is clearly a lie in this case

Dan · November 6, 2022, 2:28pm

Also there might be a little correctness problem. Contracting two terms for efficiency needs a correction on even or odd inputs (but it is an O(1) correction).

lrnv · November 6, 2022, 2:58pm

Well,

julia> [ f.(1:10) f4.(1:10)]
10×2 Matrix{Float64}:
 3.46667  3.46667
 2.89524  3.33968
 3.33968  3.33968
 2.97605  3.28374
 3.28374  3.28374
 3.01707  3.25237
 3.25237  3.25237
 3.04184  3.23232
 3.23232  3.23232
 3.0584   3.2184

And thus the value is the same when n is even, but is the next one then n is odd, so we are in fact more precise without the correction. It is not that we omit one iteration, but we do one more.

Dan · November 6, 2022, 5:27pm

If combining 2 terms leads to a big saving… why not keep going?

using Underscores

g(n,t=1) = @_ 4*sum(ifelse(isodd(_),-1,1)*(1/(2*_*t+1)), 0:(n+1))
g2(n,t=1) = 4*(@_ sum((2/(4*(_*t+1)^2-1)), 0:2:(n+1)))
g4(n,t=1) = 4*(sum(x -> (b = (x+2)*t; a = 4*b^2; 4*((a+3)/((a+3)^2-(8*b)^2))), 0:4:(n+1)))

we have:

julia> g(6)
3.017071817071818

julia> g(6) ≈ g2(6) ≈ g4(6)
true

But the real Julia fun begins with the second parameter:

julia> @_ map(g(_,1//1),1:6)
6-element Vector{Rational{Int64}}:
     52//15
    304//105
   1052//315
  10312//3465
 147916//45045
 135904//45045

julia> @_ map(g2(_,1//1),1:6)
6-element Vector{Rational{Int64}}:
    304//105
    304//105
  10312//3465
  10312//3465
 135904//45045
 135904//45045

and also:

julia> using Symbolics

julia> @variables k
1-element Vector{Num}:
 k

julia> g(4,k)
4.0 + -4 / (1 + 2k) + 4 / (1 + 4k) + -4 / (1 + 6k) + 4 / (1 + 8k) + -4 / (1 + 10k)

julia> g2(6,k)
2.6666666666666665 + 8 / (4((1 + 2k)^2) - 1) + 8 / (4((1 + 4k)^2) - 1) + 8 / (4((1 + 6k)^2) - 1)

julia> g4(6,k)
2.895238095238095 + (48 + 64((2 + 4k)^2)) / ((3 + 4((2 + 4k)^2))^2 - ((16 + 32k)^2))

and g4 is faster:

julia> begin
       @btime g(1000)
       @btime g2(1000)
       @btime g4(1000)
       end
  1.287 μs (0 allocations: 0 bytes)
  989.600 ns (0 allocations: 0 bytes)
  765.291 ns (0 allocations: 0 bytes)
3.1405966379005616

lrnv · November 6, 2022, 5:38pm

Again, removing a few multiplications give a ~20% speedup (g5), and removing the t gives a ~20% more (g6):

using Underscores

g(n,t=1) = @_ 4*sum(ifelse(isodd(_),-1,1)*(1/(2*_*t+1)), 0:(n+1))
g2(n,t=1) = 4*(@_ sum((2/(4*(_*t+1)^2-1)), 0:2:(n+1)))
g4(n,t=1) = 4*(sum(x -> (b = (x+2)*t; a = 4*b^2; 4*((a+3)/((a+3)^2-(8*b)^2))), 0:4:(n+1)))
g5(n,t=1) = 16*(sum(x -> (b = 4((x+2)*t)^2; a = b+3; a/(a^2-16b)), 0:4:(n+1)))

g6(n) = 16*(sum(x -> (b = 4(x+2)^2; a = b+3; a/(a^2-16b)), 0:4:(n+1)))

@assert g(6) ≈ g2(6) ≈ g4(6) ≈ g5(6) ≈ g6(6)


using BenchmarkTools
@btime g(10000)
@btime g2(10000)
@btime g4(10000)
@btime g5(10000)
@btime g6(10000)

outputs :

  4.429 μs (0 allocations: 0 bytes)
  2.444 μs (0 allocations: 0 bytes)
  1.930 μs (0 allocations: 0 bytes)
  1.690 μs (0 allocations: 0 bytes)
  1.360 μs (0 allocations: 0 bytes)

mcabbott · November 6, 2022, 5:47pm

These are fast but lose accuracy at large n:

julia> g2(10^6) - pi  # similar to original
-9.999980012942444e-7

julia> g6(10^6) - pi  # similar for g4, g5. Much worse at 10^9
-2.1787090125053705

Dan · November 6, 2022, 5:57pm

The reason for accuracy loss, is Int64 overflow (there is no numerical methods reason for accuracy loss).

And with:

g6f(n) = 16*(sum(x -> (b = 4(x+2.0)^2; a = b+3; a/(a^2-16b)), 0:4:(n+1)))

(note +2.0 to get Float64)

julia> g6f(10^6) - pi
-9.999960020046217e-7

julia> g(10^6) - pi
-9.999979990737984e-7

and it is faster in Float64,

julia> @btime g6f(10000)
  3.881 μs (0 allocations: 0 bytes)
3.1414926935740497

julia> @btime g6(10000)
  6.365 μs (0 allocations: 0 bytes)
3.1414926935740497

mcabbott · November 6, 2022, 6:07pm

That must be it. For me the fix slows it down again, unfortunately (M1 mac):

julia> @btime f_in(10^6) - pi  # top message
  335.333 μs (0 allocations: 0 bytes)
-9.999979067032427e-7

julia> @btime f4(10^6) - pi  # from above, g2 similar
  230.916 μs (0 allocations: 0 bytes)
9.999970012053438e-7

julia> @btime g6(10^6) - pi  # with overflow
  134.291 μs (0 allocations: 0 bytes)
-2.1787090125053705

julia> @btime g6f(10^6) - pi  # version with x+2.0
  194.208 μs (0 allocations: 0 bytes)
-9.999960020046217e-7

I believe the StepRange 0:4:(n+1) may cause issues for static compilation (due to error cases, zero step?).

Edit: solving the StepRange also speeds this up for me:

julia> @btime g7(10^6) - pi   # version from below
  185.083 μs (0 allocations: 0 bytes)
-1.0000000019161348e-6

julia> g8(n) = 16*(sum(x -> (b = float(8x-4)^2; a = b+3; a/(a^2-16b)), 1:n÷4));

julia> @btime g8(10^6) - pi  # with UnitRange not StepRange
  129.416 μs (0 allocations: 0 bytes)
-1.0000000019161348e-6

Topic		Replies	Views
A fast sum. Any downsides? Performance sum	18	962	December 16, 2024
Help understanding vectorization (or lack thereof) Performance	15	1214	June 8, 2018
How to speed up this simple code? Multithreading, simd, inbounds Performance	39	7027	January 29, 2019
Simple loop won't vectorize New to Julia	12	1629	January 29, 2019
Interesting post about SIMD dot product (and cosine similarity) Offtopic performance	17	863	December 2, 2024

Usage of lazy vector instead of function

Related topics