Julia (with compile time) about 5.6 times faster than SAS

ajsagit · June 27, 2020, 3:26pm

[Original title: Julia benchmark against SAS only 50% We need a faster Julia please]

I have read the article :BENCHMARKING JULIA AGAINST PYTHON https://juliacomputing.com/assets/pdf/JuliaVsPython.pdf
So Amazing!
Then I added a benchmark by SAS with Julia, using the sample code in the article. Here are the details:
In Julia

function pi_calcT(n::Int64)
		a0::Float64 = 2^0.5
		b0::Float64 = 0
		p0::Float64 = 2 + 2^0.5
	        for i in 1:n
			a1::Float64 = (a0 + a0^-0.5)/2.0
			b1::Float64 = ((1 + b0)*(a0^0.5))/(a0 + b0)
			p1::Float64 = ((1 + a1)*p0*b1)/(1 + b1)
			a0 = a1
			b0 = b1
		        p0 = p1
		end
end

function pi_calc(n)
		a0 = 2^0.5
		b0 = 0
		p0 = 2 + 2^0.5
	        for i in 1:n
			a1 = (a0 + a0^-0.5)/2.0
			b1 = ((1 + b0)*(a0^0.5))/(a0 + b0)
			p1 = ((1 + a1)*p0*b1)/(1	+ b1)
			a0 =	a1
			b0 = b1
		        p0 = p1
		end
end
@timev	pi_calc(10^7)
#0.071000 seconds
#elapsed time (ns): 70999817
@timev	pi_calcT(10^7)
# 0.071063 seconds
#elapsed time (ns): 71063303

In SAS

1    option fullstimer;
2    /*fcmp function*/
3    proc fcmp outlib=sasuser.funcs.julia;
4       subroutine pi_calc(n);
5         a0=2**0.5;
6         b0=0;
7         p0=2+2**0.5;
8         do i=1 to n;
9            a1=(a0+a0**-0.5)/2.0;
10           b1=((1+b0)*(a0**0.5))/(a0+b0);
11           p1=((1+a1)*p0*b1)/(1+b1);
12           a0=a1;
13           b0=b1;
14           p0=p1;
15        end;
16      endsub;
17   run;

NOTE: Function pi_calc saved to sasuser.funcs.julia.
NOTE: PROCEDURE FCMP used (Total process time):
      real time           0.08 seconds
      user cpu time       0.04 seconds
      system cpu time     0.04 seconds
      memory              6810.78k
      OS Memory           15848.00k
      Step Count                        1  Switch Count  0


18   options cmplib=sasuser.funcs;
19   /*run the pi_calc*/
20   data _null_;
21   call pi_calc(10**7);
22   run;

NOTE: DATA statement used (Total process time):
      real time           0.30 seconds
      user cpu time       0.29 seconds
      system cpu time     0.00 seconds
      memory              1746.15k
      OS Memory           17136.00k
      Step Count                        2  Switch Count  0


23   /*DATA step for pi_calc*/
24   data _null_;
25        n=10**7;
26        a0=2**0.5;
27        b0=0;
28        p0=2+2**0.5;
29        do i=1 to n;
30           a1=(a0+a0**-0.5)/2.0;
31           b1=((1+b0)*(a0**0.5))/(a0+b0);
32           p1=((1+a1)*p0*b1)/(1+b1);
33           a0=a1;
34           b0=b1;
35           p0=p1;
36        end;
37   run;

NOTE: DATA statement used (Total process time):
      real time           0.39 seconds
      user cpu time       0.37 seconds
      system cpu time     0.00 seconds
      memory              782.93k
      OS Memory           17136.00k
      Step Count                        3  Switch Count  0

Whatever I have tried, the performance of Julia is always slower than SAS.
Compare with the data disclosed in the article:

LANGUAGE TIME(SECONDS)
PYTHON_____4.27
JULIA________ 0.82(0.71)
SAS__________ 0.30 in function (0.39 in Data Step)

Julia is So wonderful, such an elegant computer language.
We need it can be Faster and Faster.
Thank you very much for all your efforts! Cheers!

mbaz · June 27, 2020, 3:51pm

It looks like you’re missing one order of magnitude?

If I’m reading the numbers correctly, it looks like Julia takes 71 ms, and SAS 390 ms.

yuyichao · June 27, 2020, 3:52pm

Use sqrt instead of ^0.5.

dlakelan · June 27, 2020, 3:53pm

In addition to @mbaz noticing that Julia is actually about 5.6 times FASTER, a smart compiler would optimize away all calls to pi_calcT entirely as it returns nothing and has no side effects

yuyichao · June 27, 2020, 4:10pm

No it may have sideeffect.

julia> sqrt(-0.1)
ERROR: DomainError with -0.1:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
 [1] throw_complex_domainerror(::Symbol, ::Float64) at /usr/share/julia/base/math.jl:33
 [2] sqrt(::Float64) at /usr/share/julia/base/math.jl:574
 [3] top-level scope at REPL[15]:1

julia> (-0.1)^(-0.5)
ERROR: DomainError with -0.1:
Exponentiation yielding a complex result requires a complex argument.
Replace x^y with (x+0im)^y, Complex(x)^y, or similar.
Stacktrace:
 [1] throw_exp_domainerror(::Float64) at /usr/share/julia/base/math.jl:37
 [2] ^(::Float64, ::Float64) at /usr/share/julia/base/math.jl:889
 [3] top-level scope at REPL[18]:1

If the check is removed with @fastmath ~~it’s much closer to optimizing everything out~~ it is optimized out for pi_calcT and it is not for pi_calc only because of the safeguard from the compiler. Making it b0 = 0.0 fixes the remaining error branch.

LaurentPlagne · June 27, 2020, 4:32pm

Maybe the title of the post should be modified accordingly : “Julia is 500 % faster than SAS We need slower Julia please (it gets confusing)”.

I hope the OP will forgive me this bad joke, but I wonder how many threads claiming that “Julia was slower than XXX” I have seen so far on discourse. Almost always with the same conclusion…

Disclaimer : “I have already wrote a similarly wrong thread Sum performance for Array{Float64,2} elements”

dlakelan · June 27, 2020, 5:24pm

It passes in an integer, it doesn’t modify the integer, and it doesn’t access any global variables, nor does it call any side-effecting functions that I can see.

So, in general a function may have a side effect, but this function doesn’t as far as I can tell, and the compiler could probably determine that.

Of course perhaps there are some tricky side effects it could have I haven’t considered, like causing some numerical traps to occur by calculating a NaN or throwing an exception or something… but naively it seems like this function is equivalent to a no-op.

Mason · June 27, 2020, 5:54pm

Throwing an error is a sideeffect and right below the part you cut off in your quote, @yuyichao showed how those powers can introduce branches that are allowed to throw errors, meaning the compiler doesn’t know it’s pure (even though it’s possible in principal for the compiler to be able to prove it won’t throw by looking at the literals).

Here’s a minimal example:

julia> function f(x)
          (0.1)^0.5
       end
f (generic function with 1 method)

julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = $(Expr(:foreigncall, "llvm.pow.f64", Float64, svec(Float64, Float64), 0, :(:llvmcall), 0.1, 0.5, 0.5, 0.1))::Float64
│   %2 = Base.ne_float(%1, %1)::Bool
│   %3 = Base.and_int(%2, true)::Bool
└──      goto #3 if not %3
2 ─      invoke Base.Math.throw_exp_domainerror(0.1::Float64)::Union{}
└──      $(Expr(:unreachable))::Union{}
3 ┄      goto #4
4 ─      return %1
) => Float64

This is after the optimizer is run and Julia has not figured out that the function won’t throw.

dlakelan · June 27, 2020, 6:20pm

Right, the compiler could in theory prove that this function doesn’t throw at some point in the future.

a0 starts as a positive literal… at every iteration a0 becomes a1, and a1 is calculated by positive functions of positive numbers. Similarly you can prove that the b values remain nonnegative and so b1 dividing by the sum of a positive number and a nonnegative number is never dividing by zero, etc.

So, this function returns nothing and can’t have any side effects. The compiler can’t figure that out today, but in principle at some point in time it could

It probably will never be worth it to do that much work during compiles. But it might be a good idea to throw off a warning: “warning function pi_calcT does not have any return statements and does not call any functions that have intentional side effects, did you forget to include a return value?”

mbaz · June 27, 2020, 6:27pm

$ julia --warnlevel=clippy

Link for those who are too young to remember the dark old days

dlakelan · June 27, 2020, 6:27pm

Elrod · June 27, 2020, 7:01pm

See Yu Yichao’s comment. LLVM figures it out:

julia> @code_llvm f(2)
define double @julia_f_1539(i64 %0) {
top:
  ret double 0x3FD43D136248490F
}

Mason · June 27, 2020, 7:34pm

Aha, good point. Importantly though, even though LLVM figures out the exact result to return as a constant, the function is still not hoisted to be evaluated at compile time.

julia> f() = 0.1^0.5
f (generic function with 1 method)

julia> g() = 0.31622776601683794
g (generic function with 1 method)

julia> @code_llvm f()
define double @julia_f_395() {
top:
  ret double 0x3FD43D136248490F
}

julia> @code_llvm g()
define double @julia_g_397() {
top:
  ret double 0x3FD43D136248490F
}

julia> @btime f()
  1.290 ns (0 allocations: 0 bytes)
0.31622776601683794

julia> @btime g()
  0.020 ns (0 allocations: 0 bytes)
0.31622776601683794

I beleive this is because Julia bases things like inlining and hoisting decisions on the typed IR, not the LLVM code.

Tamas_Papp · June 27, 2020, 7:39pm

Note that building logic of this kind into the compiler is not a panacea: it increases the complexity of the compiler and the compilation time.

Julia’s compiler is already doing a lot of work and currently the emphasis is on making compilation faster, in part by not pursuing certain things that the compiler — in theory — could figure out.

ChrisRackauckas · June 27, 2020, 7:53pm

Reflecting this fact, I fixed the title. Let me know if there’s any concerns.

Elrod · June 27, 2020, 7:54pm

I’m not sure what exactly the rules are, or how they differ when benchmarking the functions:

julia> f() = 0.1^0.5
f (generic function with 1 method)

julia> g() = 0.31622776601683794
g (generic function with 1 method)

julia> @btime f()
  1.268 ns (0 allocations: 0 bytes)
0.31622776601683794

julia> @btime g()
  0.016 ns (0 allocations: 0 bytes)
0.31622776601683794

versus using them somewhere:

julia> function sumfunc(f, N)
           s = zero(f())
           @simd for n in 1:N
               s += f()
           end
           s
       end
sumfunc (generic function with 1 method)

julia> @btime sumfunc(f, 200)
  9.327 ns (0 allocations: 0 bytes)
63.245553203367585

julia> @btime sumfunc(g, 200)
  9.814 ns (0 allocations: 0 bytes)
63.245553203367585

The llvm looks identical between both versions at a glance, only f posted below, showing the literal has been inlined:

; julia> @code_llvm sumfunc(f, 200)
define double @julia_sumfunc_1268(i64 %0) {
top:
  %.inv = icmp sgt i64 %0, 0
  %1 = select i1 %.inv, i64 %0, i64 0
  br i1 %.inv, label %L63.preheader, label %L68

L63.preheader:                                    ; preds = %top
  %min.iters.check = icmp ult i64 %1, 32
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L63.preheader
  %n.mod.vf = urem i64 %1, 32
  %n.vec = sub i64 %1, %n.mod.vf
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %2, %vector.body ]
  %vec.phi21 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]
  %vec.phi22 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %4, %vector.body ]
  %vec.phi23 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]
  %2 = fadd fast <8 x double> %vec.phi, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
  %3 = fadd fast <8 x double> %vec.phi21, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
  %4 = fadd fast <8 x double> %vec.phi22, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
  %5 = fadd fast <8 x double> %vec.phi23, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
  %index.next = add i64 %index, 32
  %6 = icmp eq i64 %index.next, %n.vec
  br i1 %6, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd fast <8 x double> %3, %2
  %bin.rdx24 = fadd fast <8 x double> %4, %bin.rdx
  %bin.rdx25 = fadd fast <8 x double> %5, %bin.rdx24
  %rdx.shuf = shufflevector <8 x double> %bin.rdx25, <8 x double> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx26 = fadd fast <8 x double> %bin.rdx25, %rdx.shuf
  %rdx.shuf27 = shufflevector <8 x double> %bin.rdx26, <8 x double> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx28 = fadd fast <8 x double> %bin.rdx26, %rdx.shuf27
  %rdx.shuf29 = shufflevector <8 x double> %bin.rdx28, <8 x double> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx30 = fadd fast <8 x double> %bin.rdx28, %rdx.shuf29
  %7 = extractelement <8 x double> %bin.rdx30, i32 0
  %cmp.n = icmp eq i64 %1, %n.vec
  br i1 %cmp.n, label %L68, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L63.preheader
  %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L63.preheader ]
  %bc.merge.rdx = phi double [ 0.000000e+00, %L63.preheader ], [ %7, %middle.block ]
  br label %L63

L63:                                              ; preds = %L63, %scalar.ph
  %value_phi317 = phi i64 [ %9, %L63 ], [ %bc.resume.val, %scalar.ph ]
  %value_phi16 = phi double [ %8, %L63 ], [ %bc.merge.rdx, %scalar.ph ]
  %8 = fadd fast double %value_phi16, 0x3FD43D136248490F
  %9 = add nuw nsw i64 %value_phi317, 1
  %10 = icmp ult i64 %9, %1
  br i1 %10, label %L63, label %L68

L68:                                              ; preds = %L63, %middle.block, %top
  %value_phi7 = phi double [ 0.000000e+00, %top ], [ %7, %middle.block ], [ %8, %L63 ]
  ret double %value_phi7
}

This means that Julia can still decide to inline the result on its own.

yuyichao · June 27, 2020, 8:02pm

Unreachable means the code will not reach AFTER the throw. It is not the reason the error branch will not be reached.

Elrod · June 27, 2020, 8:10pm

Thanks for the correction. I really need to read/learn more about Julia’s code typed, ssa, etc.
One of the reasons I still do almost everything with Julia Expr is familiarity…

So I take it that it’s LLVM that figures out those checks fail, and we’re counting on Julia deciding to inline the function while unaware of how small it is.

xiaodai · June 27, 2020, 10:18pm

I couldn’t believe my eyes when I saw the title of the thread. Then I read the thread…

karlroyen · June 28, 2020, 4:14am

Where do you get those numbers from. The only numbers I can find are from the linked pdf:

LANGUAGE	 TIME	(SECONDS)
PYTHON          4.27
JULIA                0.82

where it says 0.82 seconds, so 820 ms. And the one from the TO:

LANGUAGE TIME(SECONDS)
PYTHON_____4.27
JULIA________ 0.82(0.71)
SAS__________ 0.30 in function (0.39 in Data Step)

where it says 0.82 (0.71) seconds for Julia.

Topic		Replies	Views
Julia speed Performance question	28	2249	March 14, 2024
Small benchmark Performance benchmark	14	2755	November 21, 2018
Benchmark for latest julia? Community question	126	14800	April 1, 2019
Quite bad performance of Julia 0.6.4 vs Python+Numpy General Usage	26	5202	November 13, 2018
Question on simple performance comparison between Python and Julia General Usage question	23	1040	June 13, 2023

Julia (with compile time) about 5.6 times faster than SAS

Related topics