# Julia (with compile time) about 5.6 times faster than SAS

[Original title: Julia benchmark against SAS only 50% We need a faster Julia please]

I have read the article :BENCHMARKING JULIA AGAINST PYTHON https://juliacomputing.com/assets/pdf/JuliaVsPython.pdf
So Amazing!
Then I added a benchmark by SAS with Julia, using the sample code in the article. Here are the details:
In Julia

``````function pi_calcT(n::Int64)
a0::Float64 = 2^0.5
b0::Float64 = 0
p0::Float64 = 2 + 2^0.5
for i in 1:n
a1::Float64 = (a0 + a0^-0.5)/2.0
b1::Float64 = ((1 + b0)*(a0^0.5))/(a0 + b0)
p1::Float64 = ((1 + a1)*p0*b1)/(1 + b1)
a0 = a1
b0 = b1
p0 = p1
end
end

function pi_calc(n)
a0 = 2^0.5
b0 = 0
p0 = 2 + 2^0.5
for i in 1:n
a1 = (a0 + a0^-0.5)/2.0
b1 = ((1 + b0)*(a0^0.5))/(a0 + b0)
p1 = ((1 + a1)*p0*b1)/(1	+ b1)
a0 =	a1
b0 = b1
p0 = p1
end
end
@timev	pi_calc(10^7)
#0.071000 seconds
#elapsed time (ns): 70999817
@timev	pi_calcT(10^7)
# 0.071063 seconds
#elapsed time (ns): 71063303
``````

In SAS

``````1    option fullstimer;
2    /*fcmp function*/
3    proc fcmp outlib=sasuser.funcs.julia;
4       subroutine pi_calc(n);
5         a0=2**0.5;
6         b0=0;
7         p0=2+2**0.5;
8         do i=1 to n;
9            a1=(a0+a0**-0.5)/2.0;
10           b1=((1+b0)*(a0**0.5))/(a0+b0);
11           p1=((1+a1)*p0*b1)/(1+b1);
12           a0=a1;
13           b0=b1;
14           p0=p1;
15        end;
16      endsub;
17   run;

NOTE: Function pi_calc saved to sasuser.funcs.julia.
NOTE: PROCEDURE FCMP used (Total process time):
real time           0.08 seconds
user cpu time       0.04 seconds
system cpu time     0.04 seconds
memory              6810.78k
OS Memory           15848.00k
Step Count                        1  Switch Count  0

18   options cmplib=sasuser.funcs;
19   /*run the pi_calc*/
20   data _null_;
21   call pi_calc(10**7);
22   run;

NOTE: DATA statement used (Total process time):
real time           0.30 seconds
user cpu time       0.29 seconds
system cpu time     0.00 seconds
memory              1746.15k
OS Memory           17136.00k
Step Count                        2  Switch Count  0

23   /*DATA step for pi_calc*/
24   data _null_;
25        n=10**7;
26        a0=2**0.5;
27        b0=0;
28        p0=2+2**0.5;
29        do i=1 to n;
30           a1=(a0+a0**-0.5)/2.0;
31           b1=((1+b0)*(a0**0.5))/(a0+b0);
32           p1=((1+a1)*p0*b1)/(1+b1);
33           a0=a1;
34           b0=b1;
35           p0=p1;
36        end;
37   run;

NOTE: DATA statement used (Total process time):
real time           0.39 seconds
user cpu time       0.37 seconds
system cpu time     0.00 seconds
memory              782.93k
OS Memory           17136.00k
Step Count                        3  Switch Count  0
``````

Whatever I have tried, the performance of Julia is always slower than SAS.
Compare with the data disclosed in the article:

LANGUAGE TIME(SECONDS)
PYTHON_____4.27
JULIA________ 0.82(0.71)
SAS__________ 0.30 in function (0.39 in Data Step)

Julia is So wonderful, such an elegant computer language.
We need it can be Faster and Faster.
Thank you very much for all your efforts! Cheers!

It looks like youâ€™re missing one order of magnitude?

If Iâ€™m reading the numbers correctly, it looks like Julia takes 71 ms, and SAS 390 ms.

10 Likes

Use `sqrt` instead of `^0.5`.

In addition to @mbaz noticing that Julia is actually about 5.6 times FASTER, a smart compiler would optimize away all calls to pi_calcT entirely as it returns nothing and has no side effects

1 Like

No it may have sideeffect.

``````julia> sqrt(-0.1)
ERROR: DomainError with -0.1:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
[1] throw_complex_domainerror(::Symbol, ::Float64) at /usr/share/julia/base/math.jl:33
[2] sqrt(::Float64) at /usr/share/julia/base/math.jl:574
[3] top-level scope at REPL[15]:1

julia> (-0.1)^(-0.5)
ERROR: DomainError with -0.1:
Exponentiation yielding a complex result requires a complex argument.
Replace x^y with (x+0im)^y, Complex(x)^y, or similar.
Stacktrace:
[1] throw_exp_domainerror(::Float64) at /usr/share/julia/base/math.jl:37
[2] ^(::Float64, ::Float64) at /usr/share/julia/base/math.jl:889
[3] top-level scope at REPL[18]:1
``````

If the check is removed with `@fastmath` itâ€™s much closer to optimizing everything out it is optimized out for `pi_calcT` and it is not for `pi_calc` only because of the safeguard from the compiler. Making it `b0 = 0.0` fixes the remaining error branch.

2 Likes

Maybe the title of the post should be modified accordingly : â€śJulia is 500 % faster than SAS We need slower Julia please (it gets confusing)â€ť.

I hope the OP will forgive me this bad joke, but I wonder how many threads claiming that â€śJulia was slower than XXXâ€ť I have seen so far on discourse. Almost always with the same conclusionâ€¦

Disclaimer : â€śI have already wrote a similarly wrong thread Sum performance for Array{Float64,2} elementsâ€ť

10 Likes

It passes in an integer, it doesnâ€™t modify the integer, and it doesnâ€™t access any global variables, nor does it call any side-effecting functions that I can see.

So, in general a function may have a side effect, but this function doesnâ€™t as far as I can tell, and the compiler could probably determine that.

Of course perhaps there are some tricky side effects it could have I havenâ€™t considered, like causing some numerical traps to occur by calculating a NaN or throwing an exception or somethingâ€¦ but naively it seems like this function is equivalent to a no-op.

Throwing an error is a sideeffect and right below the part you cut off in your quote, @yuyichao showed how those powers can introduce branches that are allowed to throw errors, meaning the compiler doesnâ€™t know itâ€™s pure (even though itâ€™s possible in principal for the compiler to be able to prove it wonâ€™t throw by looking at the literals).

Hereâ€™s a minimal example:

``````julia> function f(x)
(0.1)^0.5
end
f (generic function with 1 method)

julia> @code_typed f(1)
CodeInfo(
1 â”€ %1 = \$(Expr(:foreigncall, "llvm.pow.f64", Float64, svec(Float64, Float64), 0, :(:llvmcall), 0.1, 0.5, 0.5, 0.1))::Float64
â”‚   %2 = Base.ne_float(%1, %1)::Bool
â”‚   %3 = Base.and_int(%2, true)::Bool
â””â”€â”€      goto #3 if not %3
2 â”€      invoke Base.Math.throw_exp_domainerror(0.1::Float64)::Union{}
â””â”€â”€      \$(Expr(:unreachable))::Union{}
3 â”„      goto #4
4 â”€      return %1
) => Float64
``````

This is after the optimizer is run and Julia has not figured out that the function wonâ€™t throw.

1 Like

Right, the compiler could in theory prove that this function doesnâ€™t throw at some point in the future.

a0 starts as a positive literalâ€¦ at every iteration a0 becomes a1, and a1 is calculated by positive functions of positive numbers. Similarly you can prove that the b values remain nonnegative and so b1 dividing by the sum of a positive number and a nonnegative number is never dividing by zero, etc.

So, this function returns nothing and canâ€™t have any side effects. The compiler canâ€™t figure that out today, but in principle at some point in time it could

It probably will never be worth it to do that much work during compiles. But it might be a good idea to throw off a warning: â€śwarning function pi_calcT does not have any return statements and does not call any functions that have intentional side effects, did you forget to include a return value?â€ť

`\$ julia --warnlevel=clippy`

Link for those who are too young to remember the dark old days

5 Likes

1 Like

See Yu Yichaoâ€™s comment. LLVM figures it out:

``````julia> @code_llvm f(2)
define double @julia_f_1539(i64 %0) {
top:
ret double 0x3FD43D136248490F
}
``````
1 Like

Aha, good point. Importantly though, even though LLVM figures out the exact result to return as a constant, the function is still not hoisted to be evaluated at compile time.

``````julia> f() = 0.1^0.5
f (generic function with 1 method)

julia> g() = 0.31622776601683794
g (generic function with 1 method)

julia> @code_llvm f()
define double @julia_f_395() {
top:
ret double 0x3FD43D136248490F
}

julia> @code_llvm g()
define double @julia_g_397() {
top:
ret double 0x3FD43D136248490F
}
``````
``````julia> @btime f()
1.290 ns (0 allocations: 0 bytes)
0.31622776601683794

julia> @btime g()
0.020 ns (0 allocations: 0 bytes)
0.31622776601683794
``````

I beleive this is because Julia bases things like inlining and hoisting decisions on the typed IR, not the LLVM code.

Note that building logic of this kind into the compiler is not a panacea: it increases the complexity of the compiler and the compilation time.

Juliaâ€™s compiler is already doing a lot of work and currently the emphasis is on making compilation faster, in part by not pursuing certain things that the compiler â€” in theory â€” could figure out.

2 Likes

Reflecting this fact, I fixed the title. Let me know if thereâ€™s any concerns.

8 Likes

Iâ€™m not sure what exactly the rules are, or how they differ when benchmarking the functions:

``````julia> f() = 0.1^0.5
f (generic function with 1 method)

julia> g() = 0.31622776601683794
g (generic function with 1 method)

julia> @btime f()
1.268 ns (0 allocations: 0 bytes)
0.31622776601683794

julia> @btime g()
0.016 ns (0 allocations: 0 bytes)
0.31622776601683794
``````

versus using them somewhere:

``````julia> function sumfunc(f, N)
s = zero(f())
@simd for n in 1:N
s += f()
end
s
end
sumfunc (generic function with 1 method)

julia> @btime sumfunc(f, 200)
9.327 ns (0 allocations: 0 bytes)
63.245553203367585

julia> @btime sumfunc(g, 200)
9.814 ns (0 allocations: 0 bytes)
63.245553203367585
``````

The llvm looks identical between both versions at a glance, only `f` posted below, showing the literal has been inlined:

``````; julia> @code_llvm sumfunc(f, 200)
define double @julia_sumfunc_1268(i64 %0) {
top:
%.inv = icmp sgt i64 %0, 0
%1 = select i1 %.inv, i64 %0, i64 0
br i1 %.inv, label %L63.preheader, label %L68

%min.iters.check = icmp ult i64 %1, 32
br i1 %min.iters.check, label %scalar.ph, label %vector.ph

%n.mod.vf = urem i64 %1, 32
%n.vec = sub i64 %1, %n.mod.vf
br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %2, %vector.body ]
%vec.phi21 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]
%vec.phi22 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %4, %vector.body ]
%vec.phi23 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]
%2 = fadd fast <8 x double> %vec.phi, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
%3 = fadd fast <8 x double> %vec.phi21, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
%4 = fadd fast <8 x double> %vec.phi22, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
%5 = fadd fast <8 x double> %vec.phi23, <double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F, double 0x3FD43D136248490F>
%index.next = add i64 %index, 32
%6 = icmp eq i64 %index.next, %n.vec
br i1 %6, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
%bin.rdx = fadd fast <8 x double> %3, %2
%bin.rdx24 = fadd fast <8 x double> %4, %bin.rdx
%bin.rdx25 = fadd fast <8 x double> %5, %bin.rdx24
%rdx.shuf = shufflevector <8 x double> %bin.rdx25, <8 x double> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx26 = fadd fast <8 x double> %bin.rdx25, %rdx.shuf
%rdx.shuf27 = shufflevector <8 x double> %bin.rdx26, <8 x double> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx28 = fadd fast <8 x double> %bin.rdx26, %rdx.shuf27
%rdx.shuf29 = shufflevector <8 x double> %bin.rdx28, <8 x double> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx30 = fadd fast <8 x double> %bin.rdx28, %rdx.shuf29
%7 = extractelement <8 x double> %bin.rdx30, i32 0
%cmp.n = icmp eq i64 %1, %n.vec
br i1 %cmp.n, label %L68, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L63.preheader
%bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L63.preheader ]
%bc.merge.rdx = phi double [ 0.000000e+00, %L63.preheader ], [ %7, %middle.block ]
br label %L63

L63:                                              ; preds = %L63, %scalar.ph
%value_phi317 = phi i64 [ %9, %L63 ], [ %bc.resume.val, %scalar.ph ]
%value_phi16 = phi double [ %8, %L63 ], [ %bc.merge.rdx, %scalar.ph ]
%8 = fadd fast double %value_phi16, 0x3FD43D136248490F
%9 = add nuw nsw i64 %value_phi317, 1
%10 = icmp ult i64 %9, %1
br i1 %10, label %L63, label %L68

L68:                                              ; preds = %L63, %middle.block, %top
%value_phi7 = phi double [ 0.000000e+00, %top ], [ %7, %middle.block ], [ %8, %L63 ]
ret double %value_phi7
}
``````

This means that Julia can still decide to inline the result on its own.

2 Likes

Unreachable means the code will not reach AFTER the throw. It is not the reason the error branch will not be reached.

2 Likes

One of the reasons I still do almost everything with Julia `Expr` is familiarityâ€¦

So I take it that itâ€™s LLVM that figures out those checks fail, and weâ€™re counting on Julia deciding to inline the function while unaware of how small it is.

1 Like

I couldnâ€™t believe my eyes when I saw the title of the thread. Then I read the threadâ€¦

2 Likes

Where do you get those numbers from. The only numbers I can find are from the linked pdf:

``````LANGUAGE	 TIME	(SECONDS)
PYTHON          4.27
JULIA                0.82
``````

where it says 0.82 seconds, so 820 ms. And the one from the TO:

``````LANGUAGE TIME(SECONDS)
PYTHON_____4.27
JULIA________ 0.82(0.71)
SAS__________ 0.30 in function (0.39 in Data Step)
``````

where it says 0.82 (0.71) seconds for Julia.