Julia (with compile time) about 5.6 times faster than SAS

[Original title: Julia benchmark against SAS only 50% We need a faster Julia please]

I have read the article :BENCHMARKING JULIA AGAINST PYTHON https://juliacomputing.com/assets/pdf/JuliaVsPython.pdf
So Amazing!
Then I added a benchmark by SAS with Julia, using the sample code in the article. Here are the details:
In Julia

function pi_calcT(n::Int64)
		a0::Float64 = 2^0.5
		b0::Float64 = 0
		p0::Float64 = 2 + 2^0.5
	        for i in 1:n
			a1::Float64 = (a0 + a0^-0.5)/2.0
			b1::Float64 = ((1 + b0)*(a0^0.5))/(a0 + b0)
			p1::Float64 = ((1 + a1)*p0*b1)/(1 + b1)
			a0 = a1
			b0 = b1
		        p0 = p1
		end
end

function pi_calc(n)
		a0 = 2^0.5
		b0 = 0
		p0 = 2 + 2^0.5
	        for i in 1:n
			a1 = (a0 + a0^-0.5)/2.0
			b1 = ((1 + b0)*(a0^0.5))/(a0 + b0)
			p1 = ((1 + a1)*p0*b1)/(1	+ b1)
			a0 =	a1
			b0 = b1
		        p0 = p1
		end
end
@timev	pi_calc(10^7)
#0.071000 seconds
#elapsed time (ns): 70999817
@timev	pi_calcT(10^7)
# 0.071063 seconds
#elapsed time (ns): 71063303

In SAS

1    option fullstimer;
2    /*fcmp function*/
3    proc fcmp outlib=sasuser.funcs.julia;
4       subroutine pi_calc(n);
5         a0=2**0.5;
6         b0=0;
7         p0=2+2**0.5;
8         do i=1 to n;
9            a1=(a0+a0**-0.5)/2.0;
10           b1=((1+b0)*(a0**0.5))/(a0+b0);
11           p1=((1+a1)*p0*b1)/(1+b1);
12           a0=a1;
13           b0=b1;
14           p0=p1;
15        end;
16      endsub;
17   run;

NOTE: Function pi_calc saved to sasuser.funcs.julia.
NOTE: PROCEDURE FCMP used (Total process time):
      real time           0.08 seconds
      user cpu time       0.04 seconds
      system cpu time     0.04 seconds
      memory              6810.78k
      OS Memory           15848.00k
      Step Count                        1  Switch Count  0


18   options cmplib=sasuser.funcs;
19   /*run the pi_calc*/
20   data _null_;
21   call pi_calc(10**7);
22   run;

NOTE: DATA statement used (Total process time):
      real time           0.30 seconds
      user cpu time       0.29 seconds
      system cpu time     0.00 seconds
      memory              1746.15k
      OS Memory           17136.00k
      Step Count                        2  Switch Count  0


23   /*DATA step for pi_calc*/
24   data _null_;
25        n=10**7;
26        a0=2**0.5;
27        b0=0;
28        p0=2+2**0.5;
29        do i=1 to n;
30           a1=(a0+a0**-0.5)/2.0;
31           b1=((1+b0)*(a0**0.5))/(a0+b0);
32           p1=((1+a1)*p0*b1)/(1+b1);
33           a0=a1;
34           b0=b1;
35           p0=p1;
36        end;
37   run;

NOTE: DATA statement used (Total process time):
      real time           0.39 seconds
      user cpu time       0.37 seconds
      system cpu time     0.00 seconds
      memory              782.93k
      OS Memory           17136.00k
      Step Count                        3  Switch Count  0

Whatever I have tried, the performance of Julia is always slower than SAS.
Compare with the data disclosed in the article:

LANGUAGE TIME(SECONDS)
PYTHON_____4.27
JULIA________ 0.82(0.71)
SAS__________ 0.30 in function (0.39 in Data Step)

Julia is So wonderful, such an elegant computer language.
We need it can be Faster and Faster.
Thank you very much for all your efforts! Cheers!

It looks like you’re missing one order of magnitude?

If I’m reading the numbers correctly, it looks like Julia takes 71 ms, and SAS 390 ms.

10 Likes

Use sqrt instead of ^0.5.

In addition to @mbaz noticing that Julia is actually about 5.6 times FASTER, a smart compiler would optimize away all calls to pi_calcT entirely as it returns nothing and has no side effects :wink:

1 Like

No it may have sideeffect.

julia> sqrt(-0.1)
ERROR: DomainError with -0.1:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
 [1] throw_complex_domainerror(::Symbol, ::Float64) at /usr/share/julia/base/math.jl:33
 [2] sqrt(::Float64) at /usr/share/julia/base/math.jl:574
 [3] top-level scope at REPL[15]:1

julia> (-0.1)^(-0.5)
ERROR: DomainError with -0.1:
Exponentiation yielding a complex result requires a complex argument.
Replace x^y with (x+0im)^y, Complex(x)^y, or similar.
Stacktrace:
 [1] throw_exp_domainerror(::Float64) at /usr/share/julia/base/math.jl:37
 [2] ^(::Float64, ::Float64) at /usr/share/julia/base/math.jl:889
 [3] top-level scope at REPL[18]:1

If the check is removed with @fastmath it’s much closer to optimizing everything out it is optimized out for pi_calcT and it is not for pi_calc only because of the safeguard from the compiler. Making it b0 = 0.0 fixes the remaining error branch.

2 Likes

Maybe the title of the post should be modified accordingly : “Julia is 500 % faster than SAS We need slower Julia please (it gets confusing)”.

I hope the OP will forgive me this bad joke, but I wonder how many threads claiming that “Julia was slower than XXX” I have seen so far on discourse. Almost always with the same conclusion…

Disclaimer : “I have already wrote a similarly wrong thread Sum performance for Array{Float64,2} elements” :blush:

10 Likes

It passes in an integer, it doesn’t modify the integer, and it doesn’t access any global variables, nor does it call any side-effecting functions that I can see.

So, in general a function may have a side effect, but this function doesn’t as far as I can tell, and the compiler could probably determine that.

Of course perhaps there are some tricky side effects it could have I haven’t considered, like causing some numerical traps to occur by calculating a NaN or throwing an exception or something… but naively it seems like this function is equivalent to a no-op.

Throwing an error is a sideeffect and right below the part you cut off in your quote, @yuyichao showed how those powers can introduce branches that are allowed to throw errors, meaning the compiler doesn’t know it’s pure (even though it’s possible in principal for the compiler to be able to prove it won’t throw by looking at the literals).

Here’s a minimal example:

julia> function f(x)
          (0.1)^0.5
       end
f (generic function with 1 method)

julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = $(Expr(:foreigncall, "llvm.pow.f64", Float64, svec(Float64, Float64), 0, :(:llvmcall), 0.1, 0.5, 0.5, 0.1))::Float64
│   %2 = Base.ne_float(%1, %1)::Bool
│   %3 = Base.and_int(%2, true)::Bool
└──      goto #3 if not %3
2 ─      invoke Base.Math.throw_exp_domainerror(0.1::Float64)::Union{}
└──      $(Expr(:unreachable))::Union{}
3 ┄      goto #4
4 ─      return %1
) => Float64

This is after the optimizer is run and Julia has not figured out that the function won’t throw.

1 Like

Right, the compiler could in theory prove that this function doesn’t throw at some point in the future.

a0 starts as a positive literal… at every iteration a0 becomes a1, and a1 is calculated by positive functions of positive numbers. Similarly you can prove that the b values remain nonnegative and so b1 dividing by the sum of a positive number and a nonnegative number is never dividing by zero, etc.

So, this function returns nothing and can’t have any side effects. The compiler can’t figure that out today, but in principle at some point in time it could :wink:

It probably will never be worth it to do that much work during compiles. But it might be a good idea to throw off a warning: “warning function pi_calcT does not have any return statements and does not call any functions that have intentional side effects, did you forget to include a return value?”

$ julia --warnlevel=clippy

Link for those who are too young to remember the dark old days

5 Likes

:rofl:

1 Like

See Yu Yichao’s comment. LLVM figures it out:

julia> @code_llvm f(2)
define double @julia_f_1539(i64 %0) {
top:
  ret double 0x3FD43D136248490F
}
1 Like

Aha, good point. Importantly though, even though LLVM figures out the exact result to return as a constant, the function is still not hoisted to be evaluated at compile time.

julia> f() = 0.1^0.5
f (generic function with 1 method)

julia> g() = 0.31622776601683794
g (generic function with 1 method)

julia> @code_llvm f()
define double @julia_f_395() {
top:
  ret double 0x3FD43D136248490F
}

julia> @code_llvm g()
define double @julia_g_397() {
top:
  ret double 0x3FD43D136248490F
}
julia> @btime f()
  1.290 ns (0 allocations: 0 bytes)
0.31622776601683794

julia> @btime g()
  0.020 ns (0 allocations: 0 bytes)
0.31622776601683794

I beleive this is because Julia bases things like inlining and hoisting decisions on the typed IR, not the LLVM code.

Note that building logic of this kind into the compiler is not a panacea: it increases the complexity of the compiler and the compilation time.

Julia’s compiler is already doing a lot of work and currently the emphasis is on making compilation faster, in part by not pursuing certain things that the compiler — in theory — could figure out.

2 Likes

Reflecting this fact, I fixed the title. Let me know if there’s any concerns.

8 Likes

I’m not sure what exactly the rules are, or how they differ when benchmarking the functions:

julia> f() = 0.1^0.5
f (generic function with 1 method)

julia> g() = 0.31622776601683794
g (generic function with 1 method)

julia> @btime f()
  1.268 ns (0 allocations: 0 bytes)
0.31622776601683794

julia> @btime g()
  0.016 ns (0 allocations: 0 bytes)
0.31622776601683794

versus using them somewhere:

julia> function sumfunc(f, N)
           s = zero(f())
           @simd for n in 1:N
               s += f()
           end
           s
       end
sumfunc (generic function with 1 method)

julia> @btime sumfunc(f, 200)
  9.327 ns (0 allocations: 0 bytes)
63.245553203367585

julia> @btime sumfunc(g, 200)
  9.814 ns (0 allocations: 0 bytes)
63.245553203367585

The llvm looks identical between both versions at a glance, only f posted below, showing the literal has been inlined:

; julia> @code_llvm sumfunc(f, 200)
define double @julia_sumfunc_1268(i64 %0) {
top:
  %.inv = icmp sgt i64 %0, 0
  %1 = select i1 %.inv, i64 %0, i64 0
  br i1 %.inv, label %L63.preheader, label %L68

L63.preheader:                                    ; preds = %top
  %min.iters.check = icmp ult i64 %1, 32
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L63.preheader
  %n.mod.vf = urem i64 %1, 32
  %n.vec = sub i64 %1, %n.mod.vf
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %2, %vector.body ]
  %vec.phi21 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]
  %vec.phi22 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %4, %vector.body ]
  %vec.phi23 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]
  %2 = fadd fast <8 x double> %vec.phi, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
  %3 = fadd fast <8 x double> %vec.phi21, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
  %4 = fadd fast <8 x double> %vec.phi22, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
  %5 = fadd fast <8 x double> %vec.phi23, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
  %index.next = add i64 %index, 32
  %6 = icmp eq i64 %index.next, %n.vec
  br i1 %6, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd fast <8 x double> %3, %2
  %bin.rdx24 = fadd fast <8 x double> %4, %bin.rdx
  %bin.rdx25 = fadd fast <8 x double> %5, %bin.rdx24
  %rdx.shuf = shufflevector <8 x double> %bin.rdx25, <8 x double> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx26 = fadd fast <8 x double> %bin.rdx25, %rdx.shuf
  %rdx.shuf27 = shufflevector <8 x double> %bin.rdx26, <8 x double> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx28 = fadd fast <8 x double> %bin.rdx26, %rdx.shuf27
  %rdx.shuf29 = shufflevector <8 x double> %bin.rdx28, <8 x double> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx30 = fadd fast <8 x double> %bin.rdx28, %rdx.shuf29
  %7 = extractelement <8 x double> %bin.rdx30, i32 0
  %cmp.n = icmp eq i64 %1, %n.vec
  br i1 %cmp.n, label %L68, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L63.preheader
  %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L63.preheader ]
  %bc.merge.rdx = phi double [ 0.000000e+00, %L63.preheader ], [ %7, %middle.block ]
  br label %L63

L63:                                              ; preds = %L63, %scalar.ph
  %value_phi317 = phi i64 [ %9, %L63 ], [ %bc.resume.val, %scalar.ph ]
  %value_phi16 = phi double [ %8, %L63 ], [ %bc.merge.rdx, %scalar.ph ]
  %8 = fadd fast double %value_phi16, 0x3FD43D136248490F
  %9 = add nuw nsw i64 %value_phi317, 1
  %10 = icmp ult i64 %9, %1
  br i1 %10, label %L63, label %L68

L68:                                              ; preds = %L63, %middle.block, %top
  %value_phi7 = phi double [ 0.000000e+00, %top ], [ %7, %middle.block ], [ %8, %L63 ]
  ret double %value_phi7
}

This means that Julia can still decide to inline the result on its own.

2 Likes

Unreachable means the code will not reach AFTER the throw. It is not the reason the error branch will not be reached.

2 Likes

Thanks for the correction. I really need to read/learn more about Julia’s code typed, ssa, etc.
One of the reasons I still do almost everything with Julia Expr is familiarity…

So I take it that it’s LLVM that figures out those checks fail, and we’re counting on Julia deciding to inline the function while unaware of how small it is.

1 Like

I couldn’t believe my eyes when I saw the title of the thread. Then I read the thread…

2 Likes

Where do you get those numbers from. The only numbers I can find are from the linked pdf:

LANGUAGE	 TIME	(SECONDS)
PYTHON          4.27
JULIA                0.82

where it says 0.82 seconds, so 820 ms. And the one from the TO:

LANGUAGE TIME(SECONDS)
PYTHON_____4.27
JULIA________ 0.82(0.71)
SAS__________ 0.30 in function (0.39 in Data Step)

where it says 0.82 (0.71) seconds for Julia.