CAS benchmarks (Symbolics.jl and Maxima)

Here is the equivalent code in julia

function evalpoly(x, p::Tuple)
    if @generated
        N = length(p.parameters)
        ex = :(p[end])
        for i in N-1:-1:1
            ex = :(muladd(x, $ex, p[$i]))
        end
        ex
    else
        _evalpoly(x, p)
    end
end

This code is very similar to the lisp code you wrote. But there is one major difference: nowhere can you find the equivalent of double-float. In fact the evalpoly will generate optimized code for whatever coefficient types or x type you use.

For everything an integer:

julia> @code_native evalpoly(1, (1,2,3))
	.text
; ┌ @ math.jl:131 within `evalpoly'
	movq	16(%rsi), %rax
; │┌ @ math.jl:132 within `macro expansion'
; ││┌ @ promotion.jl:404 within `muladd'
; │││┌ @ int.jl:88 within `*'
	imulq	%rdi, %rax
; │││└
; │││┌ @ int.jl:87 within `+'
	addq	8(%rsi), %rax
; │││└
; │││┌ @ int.jl:88 within `*'
	imulq	%rdi, %rax
; │││└
; │││┌ @ int.jl:87 within `+'
	addq	(%rsi), %rax
; ││└└
	retq
	nopw	%cs:(%rax,%rax)
; └└

If x is a double:

julia> @code_native evalpoly(1.0, (1,2,3))
	.text
; ┌ @ math.jl:131 within `evalpoly'
; │┌ @ math.jl:132 within `macro expansion'
; ││┌ @ promotion.jl:358 within `muladd'
; │││┌ @ promotion.jl:298 within `promote'
; ││││┌ @ promotion.jl:276 within `_promote'
; │││││┌ @ number.jl:7 within `convert'
; ││││││┌ @ float.jl:94 within `Float64'
	vcvtsi2sdq	16(%rdi), %xmm1, %xmm1
	vcvtsi2sdq	8(%rdi), %xmm2, %xmm2
; │││└└└└
; │││ @ promotion.jl:358 within `muladd' @ float.jl:339
	vfmadd231sd	%xmm1, %xmm0, %xmm2     # xmm2 = (xmm0 * xmm1) + xmm2
; │││ @ promotion.jl:358 within `muladd'
; │││┌ @ promotion.jl:298 within `promote'
; ││││┌ @ promotion.jl:276 within `_promote'
; │││││┌ @ number.jl:7 within `convert'
; ││││││┌ @ float.jl:94 within `Float64'
	vcvtsi2sdq	(%rdi), %xmm3, %xmm1
; │││└└└└
; │││ @ promotion.jl:358 within `muladd' @ float.jl:339
	vfmadd213sd	%xmm1, %xmm2, %xmm0     # xmm0 = (xmm2 * xmm0) + xmm1
; ││└
	retq
	nopl	(%rax)
; └└

and so on. If for some types, there is a better algorithm? Just specify another algorithm and multiple dispatch will do its thing. That is, in my opinion, the beauty of Julia.
8 Likes

There is some ambiguity in your post. If the line in the program is presented to a compiler as
’ evalpoly(1.0, (1,2,3))’ then the only code that is needed to be generated is ‘return 6.0’.

If the line in the program is 'evalpoly(1.0, ) ’ then it can’t be unrolled at compile time. Are you saying that at run-time, the program is unrolled? That seems wasteful.

If the line in the program is ‘evalpoly(, (1,2,3))’

then it seems to me this is information is quite similar to Lisp’s (declare (type double-float D)). That is, somewhere there is a specification of the type of variable D or if an expression, it can be deduced at compile time.
Generating a particular version of evalpoly for this call-location in the source code based on the types of the arguments right here seems to be the Julia claim to fame. Is this right?
If so, I think it has some similarities to the Lisp compiler directive “inline”. It may be done by macro-expansion, compiler macro-expansion or mostly by the compiler unraveling some code.
Possibly Julia is neater on this, but really, everyone is generating code to be executed by a machine, right?

It seems to me the vector of coefficients has to be coerced (sometime) to known types, presumably float, prior to using machine float ops. Using integers and the prospect of needing bigfloats means that integer machine ops may not do the job.

Given the context of a CAS where the types of coefficients and the point of evaluation are dynamically determined means that, in general, none of this type information can be assured, and the arguments are all of “type” arbitrary expression. Perhaps then the same evalpoly could evaluate a polynomial at the symbolic point z+4, with coefficients that are symbols say (a, (z+1), 4.0). I expect a sufficiently general muladd might be enough to make the program work, though that leaves up in the air a number of questions, such as whether (z+1)*(z+4) is multiplied through or left factored.

Faster evaluation of polynomials of sufficiently high degree to make the various theoretical hacks appropriate, generally requires that the arithmetic be unit cost and fully accurate. Since neither of these conditions may be met with arbitrary-precision integers, rationals, or floats, I’d be cautious of this. Note also that polynomials of high degree (50 or more? depends on what you are doing …) are generally rather nasty functions. Try plotting product(x-i,i,1,50). So they are not so useful in applications. There are exceptions, with various orthonormal polynomials on a restricted range, but even there, high degree is unlikely.

Evaluating a single polynomial at a single point by SIMD to gain speed sounds like a stretch, but then computing hardware marches on.

GitHub - augustt198/SIMDPoly.jl does this in a pretty simple manner. Basically, if you precompute, (x,x^2,x^3,x^4), you can evaluate 4 terms of the polynomial per instruction (with AVX2). Empirically, this is faster once you get above degree 10 or so.

1 Like

Not explicitly SIMD for a single evaluation (because the idea is still that you’d SIMD across multiple evaluations), but SLEEFPirates.estrin uses the estrin method, scaling much better for multiple evaluations:

julia> coef = ntuple(_ -> randn(), Val(30));

julia> packed = packpoly1x4r(coef);

julia> @btime evalpoly1x4r($(Ref(1.2))[], $packed)
  4.644 ns (0 allocations: 0 bytes)
-204.5524827611573

julia> @btime SLEEFPirates.estrin($(Ref(1.2))[], $coef)
  5.437 ns (0 allocations: 0 bytes)
-204.5524827611573

julia> @btime evalpoly($(Ref(1.2))[], $coef)
  12.540 ns (0 allocations: 0 bytes)
-204.5524827611572

But SIMDPoly won for all the numbers of coefficients I tried when it comes to speeding up a single evaluation.

Types are known at compile time, so in that example what was known is evalpoly(::Int, ::NTuple{3,Int}).
If we define a function f such that everything is known:

julia> f() = evalpoly(1.0, (1,2,3))
f (generic function with 1 method)

julia> @code_typed f()
CodeInfo(
1 ─     return 6.0
) => Float64

If we make the coefficients known:

julia> f(x) = evalpoly(x, (1,2,3))
f (generic function with 2 methods)

julia> @code_typed f(2)
CodeInfo(
1 ─ %1 = Base.mul_int(x, 3)::Int64
│   %2 = Base.add_int(%1, 2)::Int64
│   %3 = Base.mul_int(x, %2)::Int64
│   %4 = Base.add_int(%3, 1)::Int64
└──      return %4
) => Int64

Etc.

If the number of coefficients is unknown but large, the compiler would unroll by some constant amount (dependent on the particular machine it’s compiling for) to make good use of the CPU’s execution resources.

6 Likes

Yes. If we don’t know the argument types in advance, the compiler will generate specialized code at run time. However, it is cached, so if another call happens with the same types of coefficients (which is likely) that code is reused. This is per argument signature, not per call site, so we can reuse the code no matter which call site caused the code to be generated, or when it was generated.

Also, the if @generated construct gives the compiler the option not to generate code. If it doesn’t want to generate specialized code, it can use the specified run-time-only implementation instead. In practice we always do code generation, but you can pass e.g. --compile=min to disable it.

16 Likes

This is cool to see because it reminds me of a moment that was probably a key influence on julia. I learned the lisp lore that you could, say, write a macro to add type declarations to every subexpression, and then a good lisp implementation would generate fast code. I immediately thought two things:

  1. Why not just declare the types of variables? You can, but common lisp is generally too “type-unstable” (as we would call it) to be sure you’ll get the intended code. (May not be true for floats but could happen with other types.)
  2. Why not automate it completely?

Maybe julia stands for “Jeff’s uncommon lisp is automated”?

75 Likes

That should be the official answer whenever the name is asked in the future.

19 Likes

type-unstable --uh-- I think “dynamically typed” is the euphemism.

There are at least two CAS circumstances to consider:

  1. You are generating code for numerical evaluation. This would be “gentran” with a target of Fortran or C or – maybe Julia. Frankly, a small part of CAS coding, even if heavily used by some people.

  2. You are just following your nose in doing symbolic stuff everywhere. That is, if you can say sin(3.0) you have to be prepared to do something that is reasonable for sin(3000000000000000000000000) or sin(npi) or sin(1/2x^2+8*y^2).
    Doing a just-in-time compilation of the program to simplify sin(something) doesn’t seem like a winner if the implied type of the argument is “Could Be Anything”. and the actual type is “Algebraic Expression of some sort”.

I expect that code with (optimize (speed 3)(safety 0)) that expects floats and gets something else is either going to get a wrong answer or error or crash. Code with safety > 0 may be more graceful. Depends on the lisp implementation.

I suppose common lisp programmers can use typecase to insert declarations (in effect)

(typecase x
(float … [x is declared a float here…])
(list … [x is declared a list here …]))

This would be an old-fashioned way of constructing

(defgeneric foo ((x float) …) …
(defgeneric foo ((x list) …)

I don’t know if the common lisp version of object oriented programming makes significant sacrifices in efficiency to allow for flexibility (say, in re-defining methods) whether this flexibilitly is needed or not, and if Julia, which apparently makes a different cut through the choices possible, is better for building a CAS. But if we use typecase instead of defgeneric, I suppose we are clearly expressing what we intend for the program to do.

The point that has been mentioned but maybe lost in the discussion is that people who wrote Macsyma/Maxima supposedly in Lisp really have at least 3 languages. The top-level Algol language for users. (Not so much used by system implementers, but sometimes). The lowest level language which was Maclisp, then other dialects, then Common Lisp – but maybe with a view to C or assembler of that language. And the middle language, which is really a collection of symbolic math primitives written in Lisp, available to the person writing the guts of some symbolic algorithm, like indefinite integration, series computations, whatever. For instance meval which is like Lisp’s eval but for math. or add which is like + except for any math or lisp type. Or several different simplifiers, depending on what you hope to accomplish, and what form you want the results. ratsimp, factor, ratexpand, …

Typically these programs are affected by global settings, and so you might think the Julia version of a*b is … if a,b are numbers, do the usual Julia multiply. Otherwise do something symbolic, like in Maxima you would do (mul a b). In Maxima, what happens in (mul a b) is affected by far far more than the types of a,b but on those global settings, so somehow directing a piece of computation to the name “mul”, and the types of a,b, is not attacking the nub of the problem. Like is
(mul '(x+1) '(x-1)) supposed to be (x^2-1) ? or left in factored form. or converted to a special polynomial form as an exponent-coefficient list with guts that look like … (1 2 0 -1 ) .

(I am certainly repeating myself here — the expressiveness of the underlying language model in Maxima has not, in my experience, been a hindrance to writing programs. Students or staff can learn lisp and read/write code and contribute new programs after a modest introductory period…Of course they have to know what they’re doing algorithmically and mathematically. They are free to construct yet another language level if they wish, and Lisp supports this kind of activity. Has it been done? Hardy ever. I can think of only one area using CLOS (note, CLOS was implemented maybe 20 years after initial Macsyma – history involves Lisp Machine Flavors. Would it have been used back then, maybe.)… the area is supporting a variant of bigfloat arithmetic. It lives happily with the rest of Maxima. Writing a CAS in Julia, it seems to me, will succeed or fail not as a CAS per se, but as it might be used as a tool for people committed to writing Julia programs already.)

.
`

2 Likes

Julia is dynamically typed too of course; what I mean in this case is that many of the standard library functions are designed to have less-predictable types than they could. For example sqrt will return a complex number if the argument is negative. While it’s possible to write such a function in julia too, we picked a different tradeoff for most standard functions.

As Henry Baker puts it in his classic paper on type inference for common lisp, “…the polymorphic type complexity of the Common Lisp library functions is mostly gratuitous, and both the efficiency of compiled code and the efficiency of the programmer could be increased by rationalizing this complexity.”

7 Likes

I agree there — if the structure of an expression is reflected in julia types to a large extent, then yes I think you’d end up specializing to a counter-productive degree. (Note I’m not fully up to speed on the CAS discussion; just trying to make general points.) It’s no secret that the tradeoffs we picked in julia are aimed more at the numeric case than the symbolic. So I wouldn’t necessarily try to defend the point “julia is the best language for writing a CAS”, but having a generally-fast high-level language with a lot of built-in specialization machinery doesn’t hurt.

14 Likes

The complexity of mixing types in generic built-in Common lisp code is an issue if you are using Lisp itself to do numerical computation. You have to nail down the types and trust the compiler to essentially open-code the only operations that matter, which are almost always double-precision IEEE floats or 32-bit integers.

One way around this, historically pursued several times, is to generate FORTRAN or C code, compile it, and load it back into Lisp.

See this 1981 paper of a student of mine, Doug Lanam, which shows we could do that 40 years ago.
The theory was, all you needed was to embroider the front end to make users really comfortable.

https://dl.acm.org/doi/10.1145/800206.806400

Since the CAS part of the environment was inherently, as you say, type-unstable, that uncertainty/ generality was entirely OK. The (at least traditional) approach in programming languages which was you could say 3+4, 3.0+4.0 and they would both “work ok” but you couldn’t say symbolic_x + symbolic_y, …
was a loser. You could say 3+4 but had to say symbolic_add(symbolic_x+symbolic_y).

In lisp, you could do (add 3 4) (add x y) … same syntax. So what if (add 3 4) was a little slower?
It works, and also works for bignums.

as for sqrt, my recollection was the unix library for real sqrt gave an error if it was applied to a negative number. The choices seem to include:
return a complex number — wrong type, trouble.
crash the program – rude
print some message, continue with 0. huh? or maybe sqrt(abs())

Why might 0 be a good answer?

Say that r= f(z) is supposed to be zero or tiny positive but because of roundoff it comes out as -1.0e-98
The right mathematical answer for sqrt(r) should be tiny real, or zero. So maybe zero is an ok default.

On the other hand, if you were careful, you would check the sign of the argument before calling sqrt.
The evidence is strong that most people can’t be bothered to be careful.

RJF

1 Like

Scraping my brain for other possibly relevant suggestions regarding lisp vs Julia. …
There was a facility called something like block compiling in Franz Lisp first developed on the VAX-11 to make it possible to run Macsyma as fast as possible on this new computer with a much larger address space than the DEC-10 on which Macsyma was developed initially. The idea was sort of like packages, but with a big difference.

Consider that you put together a bunch of programs that call each other, but only a few that are called by external functions. Internally you know exactly how each function is called – number of arguments for instance. Instead of the full-fledged rather expensive call instruction of the architecture (say VAX or Motorola 68000) you can do stuff like putting arguments in registers, use pushj instead of call, etc.

For externally called functions you would have an extra, more elaborate, way of calling.
So inside the block you would not be able to redefine functions at runtime, or trace functions, etc. But they would call each other with very little overhead.

A hack with a similar intent but rather different in implementation was available in Maclisp on DEC-10 computers, setting of the variable NOUUO. For a discussion, see
http://www.maclisp.info/pitmanual/system.html

The nub of this is to recognize that – if most of what you are doing is one function calling another calling another and ultimately doing small potatoes, then what to do? You can make calling as fast as possible.

Does the Julia design have something like this?

A lot of people still write domain-specific compilers like this, either by generating C/C++/Fortran or by directly interfacing LLVM. (You see it pretty often in the scientific Python world, for example, e.g. PyDSTool, FEniCS and UFL, loopy, and JiTCODE , not to mention SymPy’s code generation and Python-subset compilers like Numba and Pythran.) A basic issue is that the composability tends to be limited — if I want to plug my own type (e.g. a new number type for quaternions, dual numbers, or double-double arithmetic) into the generated code (and have it be fast), I need to teach the code generator / DSL about that type first. Whereas if the compiler is Julia itself, then implementing my type in Julia is sufficient to teach the compiler how to compose it with other Julia code. (Obviously, the same is true for any general-purpose language that compiles well and has sufficient support for polymorphism, but the combination of a dynamic language and fast multiple dispatch is pretty useful to enable composability.)

15 Likes

Julia automatically does heuristic inlining to avoid function call overhead - basically copy-pasting the called function into the calling scope during compilation, but only if the cost of a call is within some threshold of the estimated cost of the function itself. There’s also been a lot of work to enhance constant-propagation to eliminate unused variables & branches within closures & inlined functions. See, for example,

f(x, y) = 2x + y
g(x, y) = y < 0 ? f(x, 0) : f(3x, 4y)
h(x) = g(x, 1)
julia> @code_native h(1)
        .text
; ┌ @ REPL[22]:1 within `h'
        pushq   %rbp
        movq    %rsp, %rbp
; │┌ @ REPL[21]:1 within `g'
; ││┌ @ REPL[11]:1 within `f'
; │││┌ @ int.jl:88 within `*'
        leaq    (%rcx,%rcx,2), %rax
; │││└
; │││┌ @ int.jl:87 within `+'
        leaq    4(%rax,%rax), %rax
; │└└└
        popq    %rbp
        retq
        nop

No callq instructions!

f and g were inlined into h, the branch was eliminated, and the integer constants got compiled into the h(::Int) method.

5 Likes

More readable if using Intel syntax and removing debug info:

julia> @cn h(1)
        .text
        lea     rax, [rdi + 2*rdi]
        add     rax, rax
        add     rax, 4
        ret
        nop     dword ptr [rax]

we can see that it first calculates rdi + 2*rdi (== 3*rdi), assigning it to rax, which it then adds to itself (same as *2), and finally adds 4.
This is exactly what we would have gotten from f(x) = 6x+4 (i.e., the compiler would split the *6 into lea and add).
This is different than what you reported, which seems to have replaced the last two adds with

        lea     rax, [rax + rax + 4]

Which does the same thing, but uses one less instruction.

I tried a couple different Julia+LLVM versions and got the two adds each time, so I don’t think this difference is because of LLVM version.
I assume this was on your Zen2 computer?

Checking Agner Fog’s instruction tables, lea is faster on Zen2 than it is on Skylake(-X).
so LLVM picked the version fastest for our specific CPUs.

Reciprocal throughputs (lower is better):

Instruction Zen2 Skylake-X
lea-2 1/4 1/2
lea-3 1/4 1
add 1/3 1/4

The N in lea-N means how many arguments. So lea rax, [rax + rax + 4] would be lea-3, which would be much faster on Zen2 than on Skylake-X.

5 Likes

This topic is turning out to be an interesting (if not very focused) discussion.

Nevertheless, I am still not sure who the “you” is you are addressing here. Julia is not generally advertised for its CAS capabilities, and it is not clear what claims you are addressing.

While I have a lot of respect for your contributions to Macsyma and CL, but I find this tone somewhat condescending and arrogant, given that you initiated this discussion.

9 Likes

I think any time SymPy is mentioned as too slow, I take it as an implicit claim that Julia would be faster and (more or less) eventually doing the same thing. Of course if it is not doing the same thing, comparisons become less useful. :slight_smile:

Regarding providing code (etc). I am in favor of test results in published papers being reproducible. In this forum I’m not sure that this is often provided. Regardless, it strkes me as somewhat improbable that readers here would wish to run Lisp benchmarks on their own computer. If I am mistaken and someone has an interest in such tests, I can try to nude him/her past possible difficulties. Putting together a full instruction manual starting from ground zero is a lot of work if it is not going to be used. Apologies for expressing this notion in a brief piece of snark.

In posts by others, I see impressive results in compiler optimizations on Julia code including aggresive use of unrolling, constant propagation, It might make it harder to debug code, but that’s a common trade-off.
RJF

2 Likes

I am so sorry that my knowledge of lisp is limited to some shallow knowledge of Clojure and minimal editing of emacs configs. But one thing is for sure, there are definitely people here who wouldn’t mind to run lisp benchmarks (myself included). I think, what Julia community likes the most is the reproducibility and benchmarks, so if you can provide some assistance, you would definitely find volunteers who would be more than happy to write and run said benchmarks. Not because it is some sort of a competition, but because existence of such a benchmark suite can highlight weakness and unexplored corners of current implementation and provide a foundation for future development and improvement of Julia itself.

This is a big task no doubt, but a journey of a thousand miles begins with a single step, so if you can help with at least some steps, that would have an immense value.

10 Likes

OK, fair enough, if you want to try lisp! Here’s a test you can run on your favorite lisp on your favorite computer, and compare with any other language or system. My findings are in a comment at the end.
;;;;;;;;;;;;;; cut here;;;;;;;;;;;;;;;;;
;

;; Lisp  Horner's rule.  provided here for possible Julia benchmark comparison
;; I believe that this file could be loaded in any ANSI Common Lisp and will
;; print out timing results.
;; I used Allegro CL 10.0 Sept 2015  (old!)
;; and SBCL 1.3.12 (recent)

(defun gevalf(k x)			; k is a list of coefficients in double-float. Horner's rule.
  (let((ans 0.0d0))
    (declare (optimize (speed 3)(safety 0)) (type double-float x ans))     
    (do ((i k (cdr i)) )
	((null i)ans)
      (setf ans (+ (the double-float (car i)) (* x ans))))))
(compile 'gevalf)			; in case it wasn't already. SBCL does this automatically

;; Here the same evaluation but generic + and *, no declarations, no specified optimization
;; This code is not used here, just for reference
(defun geval(k x)			; k is a list of coefficients
  (let((ans 0))   		      
(do ((i k (cdr i)) )
    ((null i)ans)
  (setf ans (+ (car i) (* x ans))))))

;; sample test, a kind-of sparse polynomial of degree 24.
;; x^24+32x^21+9x^18+34x^12+34x^10+45*x^3 

(defvar k '(1 0 0 32 0 0 9 0 0 0 0 0 34 0 34 0 0 0 0 0 0 45 0 0 0)) 
(defvar kf (mapcar #'(lambda(r)(coerce r 'double-float)) k))

;; It is possible to generate a straight-line program if A is a list of explicit numbers
(defun hornerm(A name)			; name is the function being defined
  (let ((q (car A))
	(x (gensym)))			; q will be arithmetic expression
  (loop for i in (cdr A) do 
	(setf q `(+ ,(coerce i 'double-float) (* ,x ,q))))
 ; (print q)
 (eval  `(defun ,name(,x)(declare (double-float  ,x)(optimize(speed 3)(safety 0)))
	       ,q))
 (compile name)))

;; A cleverer version for sparse polynomials
;;This is same as hornerm but avoids the addition of coefs 0.0,
;; making it faster for sparse polys

(defun hornerm0(A name)	
  (let ((q (car A))
	(x (gensym)))	
    (loop for i in (cdr A) do 
	  (setq q  (if (= i 0.0) 
		       `(* ,x ,q)
		     `(+ ,(coerce i 'double-float) (* ,x ,q)))))
;  (print q)
  
 (eval  `(defun ,name(,x)(declare (double-float  ,x)(optimize(speed 3)(safety 0)))
	       ,q))
 (compile name)))

(defun $sparseptest()   ; set up a loop to test evaluation of sparse poly.
  (declare (optimize (speed 3)(safety 0)))
  (time(loop for i fixnum from 1 to 2000000 do  
	     (sparsep  0.5d0))))
(compile '$sparseptest) ; must be compiled otherwise we would be timing the "loop" here

(defun $denseptest()  ;; same polynomial but not checking for 0.0s.
  (declare (optimize (speed 3)(safety 0)))
  (time(loop for i fixnum from 1 to 2000000 do 
	     (densep  0.5d0))))
(compile '$denseptest)

(defun $denseloop()   ;; same polynomial but use the compact loop.
  (declare (optimize (speed 3)(safety 0)))
  (time(loop for i fixnum from 1 to 2000000 do  
	     (gevalf kf 0.5d0))))
(compile '$denseloop)

(defun $emptyloop()    ;; just see how long the (almost) empty loop takes
  (declare (optimize (speed 3)(safety 0)))
  (time(loop for i fixnum from 1 to 2000000 do (null-fun i))))
(defun null-fun(h) (+ 1 h))		; to keep empty loop from being optimized away

(compile 'null-fun)
(compile '$emptyloop)

(hornerm0 kf 'sparsep)			; disassemble.. 208 bytes on i86 sbcl, 249 in Allegro CL
(hornerm  kf 'densep)			; disassemble... 379 bytes sbcl, 722 in Allegro CL

(print 'sparse)
($sparseptest)				;0.062 sec non-gc cpu in Allegro CL
					;0.062 sec in SBCL
(print 'dense)	
($denseptest)				;0.16 sec non-gc cpu in Allegro CL
					;0.109 sec in SBCL
(print 'dense_in_loop) 
($denseloop)				;0.672 sec non-gc cpu in Allegro CL
					;0.214 sec in SBCL
(print'empty_loop) 
($emptyloop)				; 0.015 non-gc cpu in Allegro CL
					; 0.015 in SBCL

#| My conclusion after fiddling with this stuff.

1. open-coding of sparse horner's rule is worth it for this example. Removing
all those add of 0.0 gives a speedup of  1.7 to 2.6 !! (we could do this always.)

2. Open coding (compared to a tight loop) is worth it for this sparse and degree 24 poly,
for a speedup of 10X.  

3. There's some difference between SBCL and Allegro, though I have an old copy of Allegro. SBCL has the edge on these tests,

These timing tests were repeating the same computation 2,000,000 times. Try it!
|#
3 Likes

More on evaluating polynomials …
I could also post code in Lisp that could easily be copied into Julia, since it was probably copied from Fortran or C or Algol … though some of it would be specifically for double-float real and complex coefficients or evaluation points.

Computing p(x), approximately is part of the problem, but also getting a bound on the error in the evaluation, and optionally the derivative dp of p at x, as well as the error in dp.

Sometimes people need to reminded of the existence of this kind of “professional” version for programs that one might think of as “easy for anyone to code up in any programming language”

Everyone doing scientific programming should be acquainted with the examples that show that just writing out the quadratic formula as you learned it in high school is numerically unstable and there are simple ways to do much better. That’s hardly the only pitfall.
Even adding up a list of numbers can be done better. (see Kahan summation).

RJF

2 Likes