I agree there — if the structure of an expression is reflected in julia types to a large extent, then yes I think you’d end up specializing to a counter-productive degree. (Note I’m not fully up to speed on the CAS discussion; just trying to make general points.) It’s no secret that the tradeoffs we picked in julia are aimed more at the numeric case than the symbolic. So I wouldn’t necessarily try to defend the point “julia is the best language for writing a CAS”, but having a generally-fast high-level language with a lot of built-in specialization machinery doesn’t hurt.
The complexity of mixing types in generic built-in Common lisp code is an issue if you are using Lisp itself to do numerical computation. You have to nail down the types and trust the compiler to essentially open-code the only operations that matter, which are almost always double-precision IEEE floats or 32-bit integers.
One way around this, historically pursued several times, is to generate FORTRAN or C code, compile it, and load it back into Lisp.
See this 1981 paper of a student of mine, Doug Lanam, which shows we could do that 40 years ago.
The theory was, all you needed was to embroider the front end to make users really comfortable.
https://dl.acm.org/doi/10.1145/800206.806400
Since the CAS part of the environment was inherently, as you say, type-unstable, that uncertainty/ generality was entirely OK. The (at least traditional) approach in programming languages which was you could say 3+4, 3.0+4.0 and they would both “work ok” but you couldn’t say symbolic_x + symbolic_y, …
was a loser. You could say 3+4 but had to say symbolic_add(symbolic_x+symbolic_y).
In lisp, you could do (add 3 4) (add x y) … same syntax. So what if (add 3 4) was a little slower?
It works, and also works for bignums.
as for sqrt, my recollection was the unix library for real sqrt gave an error if it was applied to a negative number. The choices seem to include:
return a complex number — wrong type, trouble.
crash the program – rude
print some message, continue with 0. huh? or maybe sqrt(abs())
Why might 0 be a good answer?
Say that r= f(z) is supposed to be zero or tiny positive but because of roundoff it comes out as -1.0e-98
The right mathematical answer for sqrt(r) should be tiny real, or zero. So maybe zero is an ok default.
On the other hand, if you were careful, you would check the sign of the argument before calling sqrt.
The evidence is strong that most people can’t be bothered to be careful.
RJF
Scraping my brain for other possibly relevant suggestions regarding lisp vs Julia. …
There was a facility called something like block compiling in Franz Lisp first developed on the VAX-11 to make it possible to run Macsyma as fast as possible on this new computer with a much larger address space than the DEC-10 on which Macsyma was developed initially. The idea was sort of like packages, but with a big difference.
Consider that you put together a bunch of programs that call each other, but only a few that are called by external functions. Internally you know exactly how each function is called – number of arguments for instance. Instead of the full-fledged rather expensive call instruction of the architecture (say VAX or Motorola 68000) you can do stuff like putting arguments in registers, use pushj instead of call, etc.
For externally called functions you would have an extra, more elaborate, way of calling.
So inside the block you would not be able to redefine functions at runtime, or trace functions, etc. But they would call each other with very little overhead.
A hack with a similar intent but rather different in implementation was available in Maclisp on DEC-10 computers, setting of the variable NOUUO. For a discussion, see
http://www.maclisp.info/pitmanual/system.html
The nub of this is to recognize that – if most of what you are doing is one function calling another calling another and ultimately doing small potatoes, then what to do? You can make calling as fast as possible.
Does the Julia design have something like this?
A lot of people still write domain-specific compilers like this, either by generating C/C++/Fortran or by directly interfacing LLVM. (You see it pretty often in the scientific Python world, for example, e.g. PyDSTool, FEniCS and UFL, loopy, and JiTCODE , not to mention SymPy’s code generation and Python-subset compilers like Numba and Pythran.) A basic issue is that the composability tends to be limited — if I want to plug my own type (e.g. a new number type for quaternions, dual numbers, or double-double arithmetic) into the generated code (and have it be fast), I need to teach the code generator / DSL about that type first. Whereas if the compiler is Julia itself, then implementing my type in Julia is sufficient to teach the compiler how to compose it with other Julia code. (Obviously, the same is true for any general-purpose language that compiles well and has sufficient support for polymorphism, but the combination of a dynamic language and fast multiple dispatch is pretty useful to enable composability.)
Julia automatically does heuristic inlining to avoid function call overhead - basically copy-pasting the called function into the calling scope during compilation, but only if the cost of a call is within some threshold of the estimated cost of the function itself. There’s also been a lot of work to enhance constant-propagation to eliminate unused variables & branches within closures & inlined functions. See, for example,
f(x, y) = 2x + y
g(x, y) = y < 0 ? f(x, 0) : f(3x, 4y)
h(x) = g(x, 1)
julia> @code_native h(1)
.text
; ┌ @ REPL[22]:1 within `h'
pushq %rbp
movq %rsp, %rbp
; │┌ @ REPL[21]:1 within `g'
; ││┌ @ REPL[11]:1 within `f'
; │││┌ @ int.jl:88 within `*'
leaq (%rcx,%rcx,2), %rax
; │││└
; │││┌ @ int.jl:87 within `+'
leaq 4(%rax,%rax), %rax
; │└└└
popq %rbp
retq
nop
No callq
instructions!
f
and g
were inlined into h
, the branch was eliminated, and the integer constants got compiled into the h(::Int)
method.
More readable if using Intel syntax and removing debug info:
julia> @cn h(1)
.text
lea rax, [rdi + 2*rdi]
add rax, rax
add rax, 4
ret
nop dword ptr [rax]
we can see that it first calculates rdi + 2*rdi
(== 3*rdi
), assigning it to rax
, which it then adds to itself (same as *2
), and finally adds 4
.
This is exactly what we would have gotten from f(x) = 6x+4
(i.e., the compiler would split the *6
into lea
and add
).
This is different than what you reported, which seems to have replaced the last two add
s with
lea rax, [rax + rax + 4]
Which does the same thing, but uses one less instruction.
I tried a couple different Julia+LLVM versions and got the two add
s each time, so I don’t think this difference is because of LLVM version.
I assume this was on your Zen2 computer?
Checking Agner Fog’s instruction tables, lea
is faster on Zen2 than it is on Skylake(-X).
so LLVM picked the version fastest for our specific CPUs.
Reciprocal throughputs (lower is better):
Instruction | Zen2 | Skylake-X |
---|---|---|
lea-2 | 1/4 | 1/2 |
lea-3 | 1/4 | 1 |
add | 1/3 | 1/4 |
The N
in lea-N
means how many arguments. So lea rax, [rax + rax + 4]
would be lea-3
, which would be much faster on Zen2 than on Skylake-X.
This topic is turning out to be an interesting (if not very focused) discussion.
Nevertheless, I am still not sure who the “you” is you are addressing here. Julia is not generally advertised for its CAS capabilities, and it is not clear what claims you are addressing.
While I have a lot of respect for your contributions to Macsyma and CL, but I find this tone somewhat condescending and arrogant, given that you initiated this discussion.
I think any time SymPy is mentioned as too slow, I take it as an implicit claim that Julia would be faster and (more or less) eventually doing the same thing. Of course if it is not doing the same thing, comparisons become less useful.
Regarding providing code (etc). I am in favor of test results in published papers being reproducible. In this forum I’m not sure that this is often provided. Regardless, it strkes me as somewhat improbable that readers here would wish to run Lisp benchmarks on their own computer. If I am mistaken and someone has an interest in such tests, I can try to nude him/her past possible difficulties. Putting together a full instruction manual starting from ground zero is a lot of work if it is not going to be used. Apologies for expressing this notion in a brief piece of snark.
In posts by others, I see impressive results in compiler optimizations on Julia code including aggresive use of unrolling, constant propagation, It might make it harder to debug code, but that’s a common trade-off.
RJF
I am so sorry that my knowledge of lisp is limited to some shallow knowledge of Clojure and minimal editing of emacs configs. But one thing is for sure, there are definitely people here who wouldn’t mind to run lisp benchmarks (myself included). I think, what Julia community likes the most is the reproducibility and benchmarks, so if you can provide some assistance, you would definitely find volunteers who would be more than happy to write and run said benchmarks. Not because it is some sort of a competition, but because existence of such a benchmark suite can highlight weakness and unexplored corners of current implementation and provide a foundation for future development and improvement of Julia itself.
This is a big task no doubt, but a journey of a thousand miles begins with a single step, so if you can help with at least some steps, that would have an immense value.
OK, fair enough, if you want to try lisp! Here’s a test you can run on your favorite lisp on your favorite computer, and compare with any other language or system. My findings are in a comment at the end.
;;;;;;;;;;;;;; cut here;;;;;;;;;;;;;;;;;
;
;; Lisp Horner's rule. provided here for possible Julia benchmark comparison
;; I believe that this file could be loaded in any ANSI Common Lisp and will
;; print out timing results.
;; I used Allegro CL 10.0 Sept 2015 (old!)
;; and SBCL 1.3.12 (recent)
(defun gevalf(k x) ; k is a list of coefficients in double-float. Horner's rule.
(let((ans 0.0d0))
(declare (optimize (speed 3)(safety 0)) (type double-float x ans))
(do ((i k (cdr i)) )
((null i)ans)
(setf ans (+ (the double-float (car i)) (* x ans))))))
(compile 'gevalf) ; in case it wasn't already. SBCL does this automatically
;; Here the same evaluation but generic + and *, no declarations, no specified optimization
;; This code is not used here, just for reference
(defun geval(k x) ; k is a list of coefficients
(let((ans 0))
(do ((i k (cdr i)) )
((null i)ans)
(setf ans (+ (car i) (* x ans))))))
;; sample test, a kind-of sparse polynomial of degree 24.
;; x^24+32x^21+9x^18+34x^12+34x^10+45*x^3
(defvar k '(1 0 0 32 0 0 9 0 0 0 0 0 34 0 34 0 0 0 0 0 0 45 0 0 0))
(defvar kf (mapcar #'(lambda(r)(coerce r 'double-float)) k))
;; It is possible to generate a straight-line program if A is a list of explicit numbers
(defun hornerm(A name) ; name is the function being defined
(let ((q (car A))
(x (gensym))) ; q will be arithmetic expression
(loop for i in (cdr A) do
(setf q `(+ ,(coerce i 'double-float) (* ,x ,q))))
; (print q)
(eval `(defun ,name(,x)(declare (double-float ,x)(optimize(speed 3)(safety 0)))
,q))
(compile name)))
;; A cleverer version for sparse polynomials
;;This is same as hornerm but avoids the addition of coefs 0.0,
;; making it faster for sparse polys
(defun hornerm0(A name)
(let ((q (car A))
(x (gensym)))
(loop for i in (cdr A) do
(setq q (if (= i 0.0)
`(* ,x ,q)
`(+ ,(coerce i 'double-float) (* ,x ,q)))))
; (print q)
(eval `(defun ,name(,x)(declare (double-float ,x)(optimize(speed 3)(safety 0)))
,q))
(compile name)))
(defun $sparseptest() ; set up a loop to test evaluation of sparse poly.
(declare (optimize (speed 3)(safety 0)))
(time(loop for i fixnum from 1 to 2000000 do
(sparsep 0.5d0))))
(compile '$sparseptest) ; must be compiled otherwise we would be timing the "loop" here
(defun $denseptest() ;; same polynomial but not checking for 0.0s.
(declare (optimize (speed 3)(safety 0)))
(time(loop for i fixnum from 1 to 2000000 do
(densep 0.5d0))))
(compile '$denseptest)
(defun $denseloop() ;; same polynomial but use the compact loop.
(declare (optimize (speed 3)(safety 0)))
(time(loop for i fixnum from 1 to 2000000 do
(gevalf kf 0.5d0))))
(compile '$denseloop)
(defun $emptyloop() ;; just see how long the (almost) empty loop takes
(declare (optimize (speed 3)(safety 0)))
(time(loop for i fixnum from 1 to 2000000 do (null-fun i))))
(defun null-fun(h) (+ 1 h)) ; to keep empty loop from being optimized away
(compile 'null-fun)
(compile '$emptyloop)
(hornerm0 kf 'sparsep) ; disassemble.. 208 bytes on i86 sbcl, 249 in Allegro CL
(hornerm kf 'densep) ; disassemble... 379 bytes sbcl, 722 in Allegro CL
(print 'sparse)
($sparseptest) ;0.062 sec non-gc cpu in Allegro CL
;0.062 sec in SBCL
(print 'dense)
($denseptest) ;0.16 sec non-gc cpu in Allegro CL
;0.109 sec in SBCL
(print 'dense_in_loop)
($denseloop) ;0.672 sec non-gc cpu in Allegro CL
;0.214 sec in SBCL
(print'empty_loop)
($emptyloop) ; 0.015 non-gc cpu in Allegro CL
; 0.015 in SBCL
#| My conclusion after fiddling with this stuff.
1. open-coding of sparse horner's rule is worth it for this example. Removing
all those add of 0.0 gives a speedup of 1.7 to 2.6 !! (we could do this always.)
2. Open coding (compared to a tight loop) is worth it for this sparse and degree 24 poly,
for a speedup of 10X.
3. There's some difference between SBCL and Allegro, though I have an old copy of Allegro. SBCL has the edge on these tests,
These timing tests were repeating the same computation 2,000,000 times. Try it!
|#
More on evaluating polynomials …
I could also post code in Lisp that could easily be copied into Julia, since it was probably copied from Fortran or C or Algol … though some of it would be specifically for double-float real and complex coefficients or evaluation points.
Computing p(x), approximately is part of the problem, but also getting a bound on the error in the evaluation, and optionally the derivative dp of p at x, as well as the error in dp.
Sometimes people need to reminded of the existence of this kind of “professional” version for programs that one might think of as “easy for anyone to code up in any programming language”
Everyone doing scientific programming should be acquainted with the examples that show that just writing out the quadratic formula as you learned it in high school is numerically unstable and there are simple ways to do much better. That’s hardly the only pitfall.
Even adding up a list of numbers can be done better. (see Kahan summation).
RJF
Well, you are answering to the man that took up himself writing this PSA:
Which is probably mostly cited because its first and third points, the third one being: " Do your best to make your example self-contained (“minimal working example”, MWE ), […]"
Sorry, but it’s literally rocket science. The speed of the generated code is exactly what is used to power ModelingToolkit and help Jonnie as described in this video achieve 15,000x faster than Simulink:
I think this discussion has ran its course. I believe there is a large group of scientists in need of a symbolic-numerics system which scales and generates optimized (and parallelized) code for models like differential equations. The response I keep seeing is that you don’t think this is important to a CAS, that it’s not the true goal of a real CAS, etc. If you don’t think that’s a CAS, sure whatever. What that is really saying is that there is a major open opportunity for someone to build a CAS specifically catering to this audience, of which it is clear that a large audience exists for this in the Julia programming language. To me, that is a very clear indicator that we are building something that will be impactful to a good group of individuals, which is reason enough for such a project.
Yes, Julia does all of this and more. What you’re describing is covered by devirtualization, custom calling conventions (the C ABI is unnecessarily slow unless a function has to be callable from C), inlining and constant propagation, all of which are fairly standard modern compiler techniques, employed extensively in Julia. Julia in particular takes devirtualization further than any other system I can think of, aided by late code generation (JIT) — we don’t generate code until we have actual values to apply it to — and an absolutely reckless willingness to generate many different specializations of the same source code. Seriously, the amount of code we generate would have gotten us shot in the 70s.
I’m really excited about this project, to say the least. it’s something I was hoping would emerge. And I’m glad that you are a driving force behind it. However, I’m also glad to hear some opposing voices on discourse. I think these discussions will prove to be beneficial in ways that may not appear obvious at first. Even if it’s just further clarifying your own purposes to yourself and/or the target audience.
Interestingly, we actually faced such decision, probably in the 1970s.
Say that you want to do arithmetic with polynomials whose coefficients are elements in a finite field.
You also want to do arithmetic with polynomials whose coefficients are arbitrary precision integers.
or … double-floats, arbitrary-precision floats, rational numbers, … [say, 5 different domains. though maybe more…]
Choice one “macro-expand” the package 5 times, so you have, in effect,
polynomial_add_finite_field
polynomial_mul_finite_field
etc.
Choice two. Have only one polynomial_add, but deep inside is a call to coeff_add which has to decide – at run time, repeatedly – whether to do finite_field add, or float add or … etc.
The choice (in Macsyma) was choice 2. The DEC-10 had an 18-bit address space of 36 bit words.
Like 1.2 megabytes.
Should iet be revisited? Over the years some version of fast specialized choice 1 has probably been explored, but not in the last 15 years, to my knowledge.
It would be pretty easy to do: by defining a new package to do arithmetic mod 13, say,
defining cplus and ctimes as doing exactly that,
copying over a few thousand lines of code, without any change whatsoever, and making a list of the externally visible items ( I guess we would say APIs?) …
say finite_field-modulo_13::polynomial_add etc etc.
This would not be done JIT or automatically, but certainly if we had code generation mod13, it would be rather straightforward to generate mod7. Or more likely we would specialize finite field to just two versions:
finite-field with modulus less than 2^31 [or some fixnum/ bignum boundary]
finite-field with arbitrary precision modulus.
I suspect there’s actually not a lot of payoff, in practice, since most arithmetic for Macsyma typically has to be done in (potentially) arbitrary precision integers, and so the brief check “is modulus set” or type check is not a big overhead in that situation. On the other hand, the (small) finite field stuff is all done to get speed on certain algorithms, so some pieces are probably hacked by hand.
Julia may make some of this easier or even automatic or possibly reckless, as you say.
RJF
If only 7 and 13 were the only integers!
(If only there weren’t an unbounded number of possible user-definable types that someone might want to apply generic algorithms to! If only those users didn’t have the gall to want fast, specialized versions of generic algorithms compiled for the types they happen to be using. Alas, we’re a bit beyond manual copy-pasta of code these days.)