CAS benchmarks (Symbolics.jl and Maxima)

I still don’t understand why you are touting the speed of Julia symbolic code (compiled) compared to the speed of SymPy (interpreted).

I looked for a benchmark that could be run in (say) Maxima, but finally came up a readable (sortof) description of what, I think, is a benchmark being re-used …with a derivative of a Jacobian that is defined beginning …

MutableDenseMatrix([[Add(Mul(Symbol('l0'), Symbol('m1'), Pow(Function('u1')(Symbol('t')), Integer(2)), cos(Function('q1')(Symbol('t')))), Mul(Symbol('l0'), Symbol('m2'), Pow(Function('u1')(Symbol('t')), Integer(2)), cos(Function('q1')(Symbol('t')))), Mul(Symbol('l0'), Symbol('m3'), Pow(Function('u1')(Symbol('t')), Integer(2)), cos(Function('q1')(Symbol('t')))), Mul(Symbol('l0'), Symbol('m4'), Pow(Function('u1')(Symbol('t')), Integer(2)), cos(Function('q1')(Symbol('t')))), Mul(Symbol('l0'), Symbol('m5'), ....

Needless to say, this is not so great a form to transport around to other systems.

Now as a benchmark, this has some peculiar characteristics. I don’t know if the timing includes input and/or output. I would try to exclude that, generally. Taking the derivative of a large tree-like functional form is a memory-cache-intensive task, possibly benefitting big-time from recognizing and exploiting common sub-expressions, maybe hashcoding. Does this critically depend on generic multi-dispatch either for efficiiency or readability by humans? I doubt it. Could it be that writing the (probably simple) derivative program in C++ (symengine) or in Julia and compilng it is the total speedup? Are the derivatives simplified along the way? Or is this
algorithm differentiation?
Maybe the SymPy algorithm is relatively naive.

I kept looking to see if I coud find a benchmark with a small description to easily compare Julia to see if it would be faster than (say) Maxima, compiled and run in a decent lisp (e.g. SBCL). I also expect that the Jacobian task can be programmed in a page or two of lisp (no need for Maxima), but I may just not be untangling enough of the git chatter.

Eventually I found another benchmark… it seems to be taking an expression and substituting floating-point numbers into it, in a loop. the expression appears to be

 f = add(pow(x, integer(24)),
                             add(mul(integer(34), pow(x, integer(12))),
                                 add(mul(integer(45), pow(x, integer(3))),
                                     add(mul(integer(9), pow(x, integer(18))),
                                         add(mul(integer(34), pow(x, integer(10))),
                                             mul(integer(32), pow(x, integer(21))))))));

I read this into Maxima, with appropriate definitions for mul, add, etc, and I found that the expression is f: x^24+32*x^21+9*x^18+34*x^12+34*x^10+45*x^3

now this can be factored to take out an x^3, and/or rearranged in Horner’s rule, or – to do the obvious "let’s compare it to some other program.
timing in a loop "

for i:0.0 step 0.5  thru 10000.0 do subst(i,x,f);\

which takes Maxima 0.358 seconds and 44 megabytes on my 2009 macbook.

Another way of stating this in Maxima, and asserting that only floating point values matter, is

kkf(x):=block([],mode_declare(x,float),x^24+32.0*x^21+9.0*x^18+34.0*x^12+34.0*x^10+45.0*x^3);
compile(kkf)

for i:0.0 step 0.5 thru 10000.0 do kkf(i);

this takes 0.066 seconds and 5.6 megabytes
A fairly easy factor of 5 speedup. Still in Lisp.

It occurred to me that the LOOP was interpreted and that was an unnecessary burden.
So Subtract off the empty loop… (interpreted)

for i:0.0 step 0.5 thru 10000.0 do nothing;

takes 0.07 seconds. Uh oh. The loop is more expense when it is empty. What to do?

Define a testing program and compile THAT.

tester(max):=
block([],mode_declare(max,float),for i from 0.0 step 0.5 thru max do (mode_declare(i,float),kkf(i)))

and also run that same loop without the call to kkf…

nulltester(max):=
block([],mode_declare(max,float),for i from 0.0 step 0.5 thru max do (mode_declare(i,float),i))

tester(10000.0) takes 0.022 seconds
nulltester(10000.0) takes 0.001 seconds.

so the repeated calls to kkf now take 0.021 seconds.
compare that to the interpreted version , and there is a speedup of about a factor of 17.

warning: I did insert into the lisp code … (optimize (speed 3) (safety 0))… This might be a global flag to the compile() command, but I did not spot it.

Now you might think, oh, I’m sure we can do better with LLV or something.
Lisp has a command called disassemble.

Running this on kkf gives a bunch of op codes, some unfamiliar to me, but I think
they are a sort of prolog and some storage of the results, but there in the middle of the code
is a loop with calls to a power routine, but also this , open coded multiply-scalar-double-precision float.

..

; AF7:       660F28CB         MOVAPD XMM1, XMM3
; AFB:       F20F580D9DFEFFFF ADDSD XMM1, [RIP-355]           ; [#x231299A0]
; B03:       F20F59CB         MULSD XMM1, XMM3
; B07:       F20F580D99FEFFFF ADDSD XMM1, [RIP-359]           ; [#x231299A8]
; B0F:       F20F59CA         MULSD XMM1, XMM2
; B13:       F20F580D95FEFFFF ADDSD XMM1, [RIP-363]           ; [#x231299B0]
; B1B:       F20F59E1         MULSD XMM4, XMM1
; B1F:       F20F582589FEFFFF ADDSD XMM4, [RIP-375]           ; [#x231299B0]
; B27:       F20F59EC         MULSD XMM5, XMM4
; B2B:       F20F582D85FEFFFF ADDSD XMM5, [RIP-379]           ; [#x231299B8]
; B33:       F20F59EB         MULSD XMM5, XMM3

....

so can you get major speed up by doing something in Julia? Maybe get 10,000 processor and run in parallel? I don’t know how fast this would run in some system (like GiNaC) that is less general but is supposed to be fast. I have not used GiNaC, but certainly respect the people who built it.

Can this little benchmark be made faster in lisp?
I think that open-coding the calls to expt would make it faster yet.
An even cleverer compiler might notice that we are discarding the result of the computation for all iterations but the last, and just skip them. A numerical analyst might note that it is possible to fill an array with the values of a polynomial at evenly spaced points much faster than by evaluating the full polynomial at each point (using divided differences). There are other hacks for sparse polynomials, etc. None of these “improvements” come out of “let’s use multi-generic programming”, but might, in actuality, come from Maxima, which has commands that factor, use Horner’s rule, and/or picks out common subexpressions.

To be totally legit here, I should supply the entire code for anyone to run this benchmark on any machine, but frankly, I’m not paid enough to do that. There are plenty of hints embedded above, I think.

RJF

2 Likes

Let’s try this in Julia using Symbolics.

julia> using Symbolics, BenchmarkTools

julia> @syms x
(x,)

julia> f = x^24+32x^21+9x^18+34x^12+34x^10+45*x^3
x^24 + 45(x^3) + 34(x^10) + 34(x^12) + 9(x^18) + 32(x^21)

julia> @btime map(xᵢ -> substitute($f, $x => xᵢ), 0:0.5:10000); # symbolic substitution
  246.387 ms (3840197 allocations: 156.41 MiB)

julia> kkf = eval(build_function(f, x)) # translate the symbolic expression into a generic function
#15 (generic function with 1 method)

julia> @btime map($kkf, 0:0.5:10000);
  2.410 ms (2 allocations: 156.39 KiB)

Not too shabby! Let’s see if we can do better by feeding the generic function to LoopVectorization.jl.

julia> using LoopVectorization

julia> xvals = collect(0:0.5:10000);

julia> @btime vmap($kkf, $xvals); # single-threaded
  43.600 μs (2 allocations: 156.39 KiB)

julia> @btime vmapt($kkf, $xvals); # multithreaded
  6.600 μs (2 allocations: 156.39 KiB)

I have a 2020 laptop and I’m not familiar with Maxima, so I can’t do a head-to-head comparison, but six microseconds seems pretty healthy.

10 Likes

So your first timing number, 2.4 ms is about the same as Maxima. And frankly, that’s the benchmark.

Loop vectorization assumes that we actually want to compute 20,000 values. We don’t, at least for this benchmark.

The benchmark that we are timing is the evaluation of a single polynomial.
Since the time is too short to reliably measure the duration, we do it 20,000 times. If we did it, say 100 trillion times, you’d have a problem first producing the vector. A different benchmark.

As I previously mentioned, a really clever compiler could deduce that all but one of the values are discarded, and so only the final one, f(10000) needs to be computed.
That wouldn’t be running the benchmark either.

I suppose a plausible optimization is to open-code x^3 to xxx, which might be slightly faster, but before doing that it might be worth profiling to see how much time is spent in the “power” routine.

I am curious to know if Julia can display the assembly language for this feat? It would possibly tell us how it computes x^3.

The loop vectorization idea seems neat, even if rrelevant here. You could propose that as another benchmark, I suppose. Maxima supports what amounts to the same computation, e.g.
subst ([0.0,0.5,1.0…],x,f) or kkf([0.0, 0.5, …]). {not if kkf is declared/compiled to machine floats though)

Maxima does not, as far as I know, have specific optimization for this.

subst(matrix([ 0.0, 0.5, 1.0, …]),x,f) also works, but matrices are also by default lists of lists.
I expect that in either of these commands, just as in Julia, the simple version is inferior to a declared and specialized vector or comprehension loop optimization. I suppose this could be added to Maxima (loop vectorization for array arguments.) One way to do this would be to expand the notion of “a number” in Maxima from “a lisp number or a bigfloat” to “an array whose elements are of type Maxima number”. and try to make that fast. (getting 50X speedup via loop vectorization strikes me as impressive.)

While this is something a CAS may be asked to do, evaluating a polynomial provided as an expression, it is not really a core capability in the sense that compiling the expression to a function is almost the first step in Maxima and Julia; the capability is “generation of a function that can be compiled”. The rest is numbers, and not really CAS.

This could be made more of a symbolic challenge if it were not being done in floats. f(10000) is about 100 decimal digits long, and thus bignum arithmetic is needed, and the arithmetic cannot be so specialized to one float method. Can the same kkf function in Julia be used for bignums or bigfloats without any change?

RJF

We can inspect the generated assembly with @code_native:

julia> g(x) = x^3
g (generic function with 1 method)

julia> @code_native g(2.0)
        .text
; ┌ @ REPL[16]:1 within `g'
        pushq   %rbp
        movq    %rsp, %rbp
; │┌ @ intfuncs.jl:313 within `literal_pow'
; ││┌ @ operators.jl:560 within `*' @ float.jl:332
        vmulsd  %xmm0, %xmm0, %xmm1
        vmulsd  %xmm0, %xmm1, %xmm0
; │└└
        popq    %rbp
        retq
        nop
; └

So yes, Julia is computing x^3 as x*x*x.

kkf also works without modification for BigInt and BigFloat:

julia> kkf(BigInt(10000))
1000000000032000000000009000000000000000000000034000000340000000000000000000000000045000000000000

julia> kkf(BigFloat(10000))
1000000000032000000000009000000000000000000000034000000340000000000000000000000000045000000000000

Both take about 5 μs to evaluate one bignum on my machine.

6 Likes

The number you posted was “so the repeated calls to kkf now take 0.021 seconds”, which is 21ms, about 10x slower; am I misreading?

He’s running a 2009 Macbook (which uses something like a 2.2 GHz Core 2 Duo). I think a fair handicap puts those timings at close to what I’m seeing on my 2020 Ryzen 3 chip.

@StefanKarpinski Agreed re: the wandering focus of the thread. Maybe a moderator can split the timing/benchmarking discussion to a separate topic?

2 Likes

Really? I wouldn’t have thought it would be that different for a purely compute-bound, single-threaded, non-SIMD task like this one.

5 Likes

Just for fun, here is calling the kkf function on a quaternion of intervals of rational bigints without modifying anything:

julia> using Quaternions

julia> q = Quaternion((big(10000)//1)..(big(10001)//1))
Quaternion{Interval{Rational{BigInt}}}([10000//1, 10001//1], [0//1, 0//1], [0//1, 0//1], [0//1, 0//1], false)

julia> kkf(q)
Quaternion{Interval{Rational{BigInt}}}([1000000000032000000000009000000000000000000000034000000340000000000000000000000000045000000000000//1, 1002402762057130292417218792483211867066461638433124453839429977146956756630536623534228217410155//1], [0//1, 0//1], [0//1, 0//1], [0//1, 0//1], false)

Benchmarking it gives

julia> @benchmark kkf($q)
BenchmarkTools.Trial:
  memory estimate:  870.60 KiB
  allocs estimate:  20897
  --------------
  minimum time:     799.400 μs (0.00% GC)
  median time:      848.400 μs (0.00% GC)
  mean time:        1.123 ms (15.12% GC)
  maximum time:     20.949 ms (61.05% GC)
  --------------
  samples:          4448
  evals/sample:     1

I don’t have any frame of reference for whether 800 μs is fast or slow for this type of number. I would be curious to see how a Maxima generated function performs on a quaternion of intervals of rational bigints.

8 Likes

oops, my bad. Moved a decimal point.

It’s almost always possible to hack a benchmark, and sometimes fun, but that’s just an error on my part. I could find a faster computer, too…

still, the CAS component of this benchmark is how effectively one can take an expression (here, a polynomial P in x) and render it in some way as to make it evaluable at a numerical point, e.g. 0.5. Either by substitution into the expression P, or by “compiling” P into a numerical function F. The CAS does a good job is when F runs fast.
If this is an important problem, that is one has lots of different polynomials, and they must be evaluated many times, it may be worth considering how to do it using the fact that it is a polynomial and does not involve (say) cosine, or square-root.

A really really fast F need not “compile” P at all, but just extract the coefficients. Then a standard loop Horner’s Rule can be used to evaluate it. There are, I suppose, remaining issues about different types of constants (integers, floats,…) as well as overflow. There is also a potential issue if P is extremely sparse. That is, you don’t want to use Horner’s rule for x^1000+1…

A lisp version, where the coefficients are provided in a list.

(defun poly_eval(k x) ; k is a list of coefficients
(let((ans 0))
(do ((i k (cdr i)) )
((null i)ans)

(setf ans (+ (car i) (* x ans)))))) ;;;generic +,* could be specialized by declares


Compiling the inner loop … the (setq …) … is not exactly rocket science. Providing the coefficients in a list might seem inefficient, but if everything is in cache, which is faster, an indirect address (following a pointer) or incrementing a count and getting the next field in an array?

This has strayed pretty far from a CAS, if you think that what a CAS should be doing is your calculus homework, or huge masses of tedious algebra.

In any case, thanks for the info regarding particular benchmarks here.
RJF

For evaluating polynomials by Horner’s rule, I’ve found it’s much better to inline and unroll the whole calculation so that there is no loop, just a sequence of fma’s. I’ve gotten big speedups (2-3x) in this way for the evaluation of special functions in Julia compared to traditional Fortran implementations (CEPHES etc) that used loops over arrays of coefficients.

Julia’s evalpoly function does this unrolling for you given a tuple of coefficients. It even uses a fancier non-Horner recurrence for complex arguments that saves a factor of ~2 in the number of multiplications (and is significantly faster than Horner in practice).

And there is even a package that generates inlined SIMD-based polynomial evaluation, which beats the scalar algorithm for degrees > 16 or so: GitHub - augustt198/SIMDPoly.jl

10 Likes

I did some experiments with the lisp Horner code (in a loop).

Initially it seemed that the code for the generic case was just a little slower than the
optimized-for-double-precision code. The reason was the testing framework was taking
too much time. So I had to compile and optimize the testing loop. It seems that comparing the two programs (generic and floating-point only), each compiled with (optimize (speed 3)(safety 0)) on SBCL lisp, there is approximately 10x speedup on the floating-only version. The test data is the same in both cases.

These are still loops rather than straight-line open coded.

Here’s a program I wrote some years ago that takes a list of coefficients and defines a straight-line program by unrolling the horner polynomial.


(defun hornerm(A name)			;A is list of numbers. name is what to call the program e.g. 'f
  (let ((q (car A))
	(x (gensym)))			; q will be arithmetic expression
  (loop for i double-float in (cdr A) do 
	(setf q `(+ ,(coerce i 'double-float) (* ,x ,q))))
  
 (eval  (setf q `(defun ,name(,x)(declare (double-float  ,x)(optimize(speed 3)(safety 0)))
	       ,q)))  
;;  (print q) ;; remove if you don't need this for debugging
 (compile name)))


for example,
(hornerm '(10 20 30) 'foo)

defines and compiles this program


(DEFUN FOO (#:G1274)
  (DECLARE (DOUBLE-FLOAT #:G1274) (OPTIMIZE (SPEED 3) (SAFETY 0)))
  (+ 30.0d0 (* #:G1274 (+ 20.0d0 (* #:G1274 10)))))

so the question then is, does the straight-line program run faster? About 7%.
The assembler code looks like…

; 2A4: DDDA FSTPD FR2
; 2A6: D9EE FLDZ
; 2A8: D9CA FXCH FR2
; 2AA: D8C2 FADDD FR2
; 2AC: 9B WAIT
; 2AD: D8C9 FMULD FR1
; 2AF: 9B WAIT
repeated many times. Maybe those WAITs are bad news?
I don’t really know intel assembly code much but it seems this is doing loads, stores, adds, multiplies. Could it be faster, practically speaking [ignoring some theoretical hacks]? Well, some of those adds are adding 0.0. Maybe a “sparse” version would be faster.

At this point it seems that it is pretty much automated: one can start from a polynomial expression in one variable with float coefficients and convert it to straight line assembly code.
I don’t see how Julia can make a much more efficient program, at least for a dense polynomial of modest degree, but maybe I’m missing something in this analysis.

Of course if you want to evaluate foo at N different points and have q processors, you might do stuff in parallel. The ANSI Common Lisp standard doesn’t address this, but particular implementations tend to support some version of parallel processing.

Cheers

Here is the equivalent code in julia

function evalpoly(x, p::Tuple)
    if @generated
        N = length(p.parameters)
        ex = :(p[end])
        for i in N-1:-1:1
            ex = :(muladd(x, $ex, p[$i]))
        end
        ex
    else
        _evalpoly(x, p)
    end
end

This code is very similar to the lisp code you wrote. But there is one major difference: nowhere can you find the equivalent of double-float. In fact the evalpoly will generate optimized code for whatever coefficient types or x type you use.

For everything an integer:

julia> @code_native evalpoly(1, (1,2,3))
	.text
; ┌ @ math.jl:131 within `evalpoly'
	movq	16(%rsi), %rax
; │┌ @ math.jl:132 within `macro expansion'
; ││┌ @ promotion.jl:404 within `muladd'
; │││┌ @ int.jl:88 within `*'
	imulq	%rdi, %rax
; │││└
; │││┌ @ int.jl:87 within `+'
	addq	8(%rsi), %rax
; │││└
; │││┌ @ int.jl:88 within `*'
	imulq	%rdi, %rax
; │││└
; │││┌ @ int.jl:87 within `+'
	addq	(%rsi), %rax
; ││└└
	retq
	nopw	%cs:(%rax,%rax)
; └└

If x is a double:

julia> @code_native evalpoly(1.0, (1,2,3))
	.text
; ┌ @ math.jl:131 within `evalpoly'
; │┌ @ math.jl:132 within `macro expansion'
; ││┌ @ promotion.jl:358 within `muladd'
; │││┌ @ promotion.jl:298 within `promote'
; ││││┌ @ promotion.jl:276 within `_promote'
; │││││┌ @ number.jl:7 within `convert'
; ││││││┌ @ float.jl:94 within `Float64'
	vcvtsi2sdq	16(%rdi), %xmm1, %xmm1
	vcvtsi2sdq	8(%rdi), %xmm2, %xmm2
; │││└└└└
; │││ @ promotion.jl:358 within `muladd' @ float.jl:339
	vfmadd231sd	%xmm1, %xmm0, %xmm2     # xmm2 = (xmm0 * xmm1) + xmm2
; │││ @ promotion.jl:358 within `muladd'
; │││┌ @ promotion.jl:298 within `promote'
; ││││┌ @ promotion.jl:276 within `_promote'
; │││││┌ @ number.jl:7 within `convert'
; ││││││┌ @ float.jl:94 within `Float64'
	vcvtsi2sdq	(%rdi), %xmm3, %xmm1
; │││└└└└
; │││ @ promotion.jl:358 within `muladd' @ float.jl:339
	vfmadd213sd	%xmm1, %xmm2, %xmm0     # xmm0 = (xmm2 * xmm0) + xmm1
; ││└
	retq
	nopl	(%rax)
; └└

and so on. If for some types, there is a better algorithm? Just specify another algorithm and multiple dispatch will do its thing. That is, in my opinion, the beauty of Julia.
9 Likes

There is some ambiguity in your post. If the line in the program is presented to a compiler as
’ evalpoly(1.0, (1,2,3))’ then the only code that is needed to be generated is ‘return 6.0’.

If the line in the program is 'evalpoly(1.0, ) ’ then it can’t be unrolled at compile time. Are you saying that at run-time, the program is unrolled? That seems wasteful.

If the line in the program is ‘evalpoly(, (1,2,3))’

then it seems to me this is information is quite similar to Lisp’s (declare (type double-float D)). That is, somewhere there is a specification of the type of variable D or if an expression, it can be deduced at compile time.
Generating a particular version of evalpoly for this call-location in the source code based on the types of the arguments right here seems to be the Julia claim to fame. Is this right?
If so, I think it has some similarities to the Lisp compiler directive “inline”. It may be done by macro-expansion, compiler macro-expansion or mostly by the compiler unraveling some code.
Possibly Julia is neater on this, but really, everyone is generating code to be executed by a machine, right?

It seems to me the vector of coefficients has to be coerced (sometime) to known types, presumably float, prior to using machine float ops. Using integers and the prospect of needing bigfloats means that integer machine ops may not do the job.

Given the context of a CAS where the types of coefficients and the point of evaluation are dynamically determined means that, in general, none of this type information can be assured, and the arguments are all of “type” arbitrary expression. Perhaps then the same evalpoly could evaluate a polynomial at the symbolic point z+4, with coefficients that are symbols say (a, (z+1), 4.0). I expect a sufficiently general muladd might be enough to make the program work, though that leaves up in the air a number of questions, such as whether (z+1)*(z+4) is multiplied through or left factored.

Faster evaluation of polynomials of sufficiently high degree to make the various theoretical hacks appropriate, generally requires that the arithmetic be unit cost and fully accurate. Since neither of these conditions may be met with arbitrary-precision integers, rationals, or floats, I’d be cautious of this. Note also that polynomials of high degree (50 or more? depends on what you are doing …) are generally rather nasty functions. Try plotting product(x-i,i,1,50). So they are not so useful in applications. There are exceptions, with various orthonormal polynomials on a restricted range, but even there, high degree is unlikely.

Evaluating a single polynomial at a single point by SIMD to gain speed sounds like a stretch, but then computing hardware marches on.

GitHub - augustt198/SIMDPoly.jl does this in a pretty simple manner. Basically, if you precompute, (x,x^2,x^3,x^4), you can evaluate 4 terms of the polynomial per instruction (with AVX2). Empirically, this is faster once you get above degree 10 or so.

1 Like

Not explicitly SIMD for a single evaluation (because the idea is still that you’d SIMD across multiple evaluations), but SLEEFPirates.estrin uses the estrin method, scaling much better for multiple evaluations:

julia> coef = ntuple(_ -> randn(), Val(30));

julia> packed = packpoly1x4r(coef);

julia> @btime evalpoly1x4r($(Ref(1.2))[], $packed)
  4.644 ns (0 allocations: 0 bytes)
-204.5524827611573

julia> @btime SLEEFPirates.estrin($(Ref(1.2))[], $coef)
  5.437 ns (0 allocations: 0 bytes)
-204.5524827611573

julia> @btime evalpoly($(Ref(1.2))[], $coef)
  12.540 ns (0 allocations: 0 bytes)
-204.5524827611572

But SIMDPoly won for all the numbers of coefficients I tried when it comes to speeding up a single evaluation.

Types are known at compile time, so in that example what was known is evalpoly(::Int, ::NTuple{3,Int}).
If we define a function f such that everything is known:

julia> f() = evalpoly(1.0, (1,2,3))
f (generic function with 1 method)

julia> @code_typed f()
CodeInfo(
1 ─     return 6.0
) => Float64

If we make the coefficients known:

julia> f(x) = evalpoly(x, (1,2,3))
f (generic function with 2 methods)

julia> @code_typed f(2)
CodeInfo(
1 ─ %1 = Base.mul_int(x, 3)::Int64
│   %2 = Base.add_int(%1, 2)::Int64
│   %3 = Base.mul_int(x, %2)::Int64
│   %4 = Base.add_int(%3, 1)::Int64
└──      return %4
) => Int64

Etc.

If the number of coefficients is unknown but large, the compiler would unroll by some constant amount (dependent on the particular machine it’s compiling for) to make good use of the CPU’s execution resources.

6 Likes

Yes. If we don’t know the argument types in advance, the compiler will generate specialized code at run time. However, it is cached, so if another call happens with the same types of coefficients (which is likely) that code is reused. This is per argument signature, not per call site, so we can reuse the code no matter which call site caused the code to be generated, or when it was generated.

Also, the if @generated construct gives the compiler the option not to generate code. If it doesn’t want to generate specialized code, it can use the specified run-time-only implementation instead. In practice we always do code generation, but you can pass e.g. --compile=min to disable it.

16 Likes

This is cool to see because it reminds me of a moment that was probably a key influence on julia. I learned the lisp lore that you could, say, write a macro to add type declarations to every subexpression, and then a good lisp implementation would generate fast code. I immediately thought two things:

  1. Why not just declare the types of variables? You can, but common lisp is generally too “type-unstable” (as we would call it) to be sure you’ll get the intended code. (May not be true for floats but could happen with other types.)
  2. Why not automate it completely?

Maybe julia stands for “Jeff’s uncommon lisp is automated”?

91 Likes

That should be the official answer whenever the name is asked in the future.

23 Likes

type-unstable --uh-- I think “dynamically typed” is the euphemism.

There are at least two CAS circumstances to consider:

  1. You are generating code for numerical evaluation. This would be “gentran” with a target of Fortran or C or – maybe Julia. Frankly, a small part of CAS coding, even if heavily used by some people.

  2. You are just following your nose in doing symbolic stuff everywhere. That is, if you can say sin(3.0) you have to be prepared to do something that is reasonable for sin(3000000000000000000000000) or sin(npi) or sin(1/2x^2+8*y^2).
    Doing a just-in-time compilation of the program to simplify sin(something) doesn’t seem like a winner if the implied type of the argument is “Could Be Anything”. and the actual type is “Algebraic Expression of some sort”.

I expect that code with (optimize (speed 3)(safety 0)) that expects floats and gets something else is either going to get a wrong answer or error or crash. Code with safety > 0 may be more graceful. Depends on the lisp implementation.

I suppose common lisp programmers can use typecase to insert declarations (in effect)

(typecase x
(float … [x is declared a float here…])
(list … [x is declared a list here …]))

This would be an old-fashioned way of constructing

(defgeneric foo ((x float) …) …
(defgeneric foo ((x list) …)

I don’t know if the common lisp version of object oriented programming makes significant sacrifices in efficiency to allow for flexibility (say, in re-defining methods) whether this flexibilitly is needed or not, and if Julia, which apparently makes a different cut through the choices possible, is better for building a CAS. But if we use typecase instead of defgeneric, I suppose we are clearly expressing what we intend for the program to do.

The point that has been mentioned but maybe lost in the discussion is that people who wrote Macsyma/Maxima supposedly in Lisp really have at least 3 languages. The top-level Algol language for users. (Not so much used by system implementers, but sometimes). The lowest level language which was Maclisp, then other dialects, then Common Lisp – but maybe with a view to C or assembler of that language. And the middle language, which is really a collection of symbolic math primitives written in Lisp, available to the person writing the guts of some symbolic algorithm, like indefinite integration, series computations, whatever. For instance meval which is like Lisp’s eval but for math. or add which is like + except for any math or lisp type. Or several different simplifiers, depending on what you hope to accomplish, and what form you want the results. ratsimp, factor, ratexpand, …

Typically these programs are affected by global settings, and so you might think the Julia version of a*b is … if a,b are numbers, do the usual Julia multiply. Otherwise do something symbolic, like in Maxima you would do (mul a b). In Maxima, what happens in (mul a b) is affected by far far more than the types of a,b but on those global settings, so somehow directing a piece of computation to the name “mul”, and the types of a,b, is not attacking the nub of the problem. Like is
(mul '(x+1) '(x-1)) supposed to be (x^2-1) ? or left in factored form. or converted to a special polynomial form as an exponent-coefficient list with guts that look like … (1 2 0 -1 ) .

(I am certainly repeating myself here — the expressiveness of the underlying language model in Maxima has not, in my experience, been a hindrance to writing programs. Students or staff can learn lisp and read/write code and contribute new programs after a modest introductory period…Of course they have to know what they’re doing algorithmically and mathematically. They are free to construct yet another language level if they wish, and Lisp supports this kind of activity. Has it been done? Hardy ever. I can think of only one area using CLOS (note, CLOS was implemented maybe 20 years after initial Macsyma – history involves Lisp Machine Flavors. Would it have been used back then, maybe.)… the area is supporting a variant of bigfloat arithmetic. It lives happily with the rest of Maxima. Writing a CAS in Julia, it seems to me, will succeed or fail not as a CAS per se, but as it might be used as a tool for people committed to writing Julia programs already.)

.
`

3 Likes

Julia is dynamically typed too of course; what I mean in this case is that many of the standard library functions are designed to have less-predictable types than they could. For example sqrt will return a complex number if the argument is negative. While it’s possible to write such a function in julia too, we picked a different tradeoff for most standard functions.

As Henry Baker puts it in his classic paper on type inference for common lisp, “…the polymorphic type complexity of the Common Lisp library functions is mostly gratuitous, and both the efficiency of compiled code and the efficiency of the programmer could be increased by rationalizing this complexity.”

8 Likes