I still don’t understand why you are touting the speed of Julia symbolic code (compiled) compared to the speed of SymPy (interpreted).
I looked for a benchmark that could be run in (say) Maxima, but finally came up a readable (sortof) description of what, I think, is a benchmark being re-used …with a derivative of a Jacobian that is defined beginning …
MutableDenseMatrix([[Add(Mul(Symbol('l0'), Symbol('m1'), Pow(Function('u1')(Symbol('t')), Integer(2)), cos(Function('q1')(Symbol('t')))), Mul(Symbol('l0'), Symbol('m2'), Pow(Function('u1')(Symbol('t')), Integer(2)), cos(Function('q1')(Symbol('t')))), Mul(Symbol('l0'), Symbol('m3'), Pow(Function('u1')(Symbol('t')), Integer(2)), cos(Function('q1')(Symbol('t')))), Mul(Symbol('l0'), Symbol('m4'), Pow(Function('u1')(Symbol('t')), Integer(2)), cos(Function('q1')(Symbol('t')))), Mul(Symbol('l0'), Symbol('m5'), ....
Needless to say, this is not so great a form to transport around to other systems.
Now as a benchmark, this has some peculiar characteristics. I don’t know if the timing includes input and/or output. I would try to exclude that, generally. Taking the derivative of a large tree-like functional form is a memory-cache-intensive task, possibly benefitting big-time from recognizing and exploiting common sub-expressions, maybe hashcoding. Does this critically depend on generic multi-dispatch either for efficiiency or readability by humans? I doubt it. Could it be that writing the (probably simple) derivative program in C++ (symengine) or in Julia and compilng it is the total speedup? Are the derivatives simplified along the way? Or is this
algorithm differentiation?
Maybe the SymPy algorithm is relatively naive.
I kept looking to see if I coud find a benchmark with a small description to easily compare Julia to see if it would be faster than (say) Maxima, compiled and run in a decent lisp (e.g. SBCL). I also expect that the Jacobian task can be programmed in a page or two of lisp (no need for Maxima), but I may just not be untangling enough of the git chatter.
Eventually I found another benchmark… it seems to be taking an expression and substituting floating-point numbers into it, in a loop. the expression appears to be
f = add(pow(x, integer(24)),
add(mul(integer(34), pow(x, integer(12))),
add(mul(integer(45), pow(x, integer(3))),
add(mul(integer(9), pow(x, integer(18))),
add(mul(integer(34), pow(x, integer(10))),
mul(integer(32), pow(x, integer(21))))))));
I read this into Maxima, with appropriate definitions for mul, add, etc, and I found that the expression is f: x^24+32*x^21+9*x^18+34*x^12+34*x^10+45*x^3
now this can be factored to take out an x^3
, and/or rearranged in Horner’s rule, or – to do the obvious "let’s compare it to some other program.
timing in a loop "
for i:0.0 step 0.5 thru 10000.0 do subst(i,x,f);\
which takes Maxima 0.358 seconds and 44 megabytes on my 2009 macbook.
Another way of stating this in Maxima, and asserting that only floating point values matter, is
kkf(x):=block([],mode_declare(x,float),x^24+32.0*x^21+9.0*x^18+34.0*x^12+34.0*x^10+45.0*x^3);
compile(kkf)
for i:0.0 step 0.5 thru 10000.0 do kkf(i);
this takes 0.066 seconds and 5.6 megabytes
A fairly easy factor of 5 speedup. Still in Lisp.
It occurred to me that the LOOP was interpreted and that was an unnecessary burden.
So Subtract off the empty loop… (interpreted)
for i:0.0 step 0.5 thru 10000.0 do nothing;
takes 0.07 seconds. Uh oh. The loop is more expense when it is empty. What to do?
Define a testing program and compile THAT.
tester(max):=
block([],mode_declare(max,float),for i from 0.0 step 0.5 thru max do (mode_declare(i,float),kkf(i)))
and also run that same loop without the call to kkf…
nulltester(max):=
block([],mode_declare(max,float),for i from 0.0 step 0.5 thru max do (mode_declare(i,float),i))
tester(10000.0)
takes 0.022 seconds
nulltester(10000.0)
takes 0.001 seconds.
so the repeated calls to kkf now take 0.021 seconds.
compare that to the interpreted version , and there is a speedup of about a factor of 17.
warning: I did insert into the lisp code … (optimize (speed 3) (safety 0))
… This might be a global flag to the compile()
command, but I did not spot it.
Now you might think, oh, I’m sure we can do better with LLV or something.
Lisp has a command called disassemble.
Running this on kkf gives a bunch of op codes, some unfamiliar to me, but I think
they are a sort of prolog and some storage of the results, but there in the middle of the code
is a loop with calls to a power routine, but also this , open coded multiply-scalar-double-precision float.
..
; AF7: 660F28CB MOVAPD XMM1, XMM3
; AFB: F20F580D9DFEFFFF ADDSD XMM1, [RIP-355] ; [#x231299A0]
; B03: F20F59CB MULSD XMM1, XMM3
; B07: F20F580D99FEFFFF ADDSD XMM1, [RIP-359] ; [#x231299A8]
; B0F: F20F59CA MULSD XMM1, XMM2
; B13: F20F580D95FEFFFF ADDSD XMM1, [RIP-363] ; [#x231299B0]
; B1B: F20F59E1 MULSD XMM4, XMM1
; B1F: F20F582589FEFFFF ADDSD XMM4, [RIP-375] ; [#x231299B0]
; B27: F20F59EC MULSD XMM5, XMM4
; B2B: F20F582D85FEFFFF ADDSD XMM5, [RIP-379] ; [#x231299B8]
; B33: F20F59EB MULSD XMM5, XMM3
....
so can you get major speed up by doing something in Julia? Maybe get 10,000 processor and run in parallel? I don’t know how fast this would run in some system (like GiNaC) that is less general but is supposed to be fast. I have not used GiNaC, but certainly respect the people who built it.
Can this little benchmark be made faster in lisp?
I think that open-coding the calls to expt would make it faster yet.
An even cleverer compiler might notice that we are discarding the result of the computation for all iterations but the last, and just skip them. A numerical analyst might note that it is possible to fill an array with the values of a polynomial at evenly spaced points much faster than by evaluating the full polynomial at each point (using divided differences). There are other hacks for sparse polynomials, etc. None of these “improvements” come out of “let’s use multi-generic programming”, but might, in actuality, come from Maxima, which has commands that factor, use Horner’s rule, and/or picks out common subexpressions.
To be totally legit here, I should supply the entire code for anyone to run this benchmark on any machine, but frankly, I’m not paid enough to do that. There are plenty of hints embedded above, I think.
RJF