Here is a micro-benchmark of Julia, Nim, sbcl, Gambit-Scheme, and Racket for summing a vector of floating point values. Not really any feedback for Julia: this doesn’t warrant finding any extra performance tricks for Julia. Note that Racket discourse community was extremely helpful. I don’t think the final recommended code is super-obvious, but it is documented in the Racket Guide performance hints section. Gambit-Scheme developer Marc Feeley was very helpful, too.
Nothing surprising here. Startling how much simpler Julia makes this. Also, my bumpy Lisp/Scheme voyage makes clear how many benefits Julia derived from coming out of MIT’s PL:Scheme heritage–along with Julia’s many wise departures from Scheme.
Micro-Benchmark of Summing a Floating Point Vector
Take this with a grain of salt. All micro-benchmarks are fraught; it’s easier to do wrong than to do correctly. Please correct anything I have done incorrectly or suggest a better approach, as long as it’s not particularly intricate code: it does matter that good performance can be achieved with straight forward code. For this trivial example, it is in all the languages.
This benchmark compares summing a floating point array in the following languages:
Julia: Design around matrix arithmetic; relatively easy to implement strong typing (though not required)
Nim: a nice c/c++ substitute; requires typing every variable; static compilation
SBCL: a performant lisp which makes typing possible; REPL statements compiled before execution. Setting explicit types resulted in a nearly 5x speedup.
Gambit: a scheme that compiles via c to object code; typing possible but optional
Racket: a scheme with extensive libraries; typing possible but optional; REPL statements compiled before execution; functions in the flonum library already require floating point arguments so explicit type hints didn’t further improve performance.
The code for each language appears after the benchmarks. Benchmarks were run on my 2021 MacBook Pro M1. Benchmark timing varies quite a bit depending on other process running on the machine, even if the benchmarks are run in quick succession–yet another reason for healthy dose of salt.
Language | Timing in seconds for 1000 executions |
---|---|
Julia | .07 (using BenchmarkTools package) |
Nim | .04 (second run, compile as unsafe) |
Nim | .07 (compile as release) |
Gambit compiled | .51 |
SBCL in REPL | 1.2 (no type declarations) |
SBCL in REPL | .255 (type declarations) |
Racket in REPL | .17 (for/fold comprehension) |
Racket executable | Same as in REPL |
Comments
Using somewhat non-obvious tricks, Racket is the undisputed Lisp/Scheme champion. AOT compilation in the REPL is faster than compiled Gambit-Scheme for both sbcl and Racket. Of course, it’s not even fair to compare to Julia and Nim, both of which are strongly typed languges. Julia can also be dynamically typed, but it is usually easy to get “type stability” with Julia’s excellent type inference.
I got typed/racket to work, but it did not reduce execution time. My theory–awaiting some comments from experts is that the flonum library functions, fl+ and in-flvector, are already specialized on specific types, so explicit type hinting won’t enable the compiler to optimize them.
I received some suggestions from Gambit’s developer, which provided a 17% improvement.
Look at the code below. All the performance results are reasonable, but which language would you choose for numerical work?
Code:
Julia
function bench_vec_sum(len, fval)
v = fill(fval, len)
return sum(v)
end
Nim
# compile with nim c -d:danger --app:console --opt:speed --cc:clang --mm:orc benchmark_vector.nim
import std/algorithm # for fill
import os # for cmdline parameters
import std/strutils # to parse cmdline parameters
proc sum_vec_f64(vec: seq[float64]): float64 = # returns float64
var sum: float64 = 0.0
for x in vec: # iterator is faster than calculating index and referencing each slot
sum += x
return sum
proc vec_bench(vlen: int, fval: float64) =
var vec: seq[float64]
newseq(vec, vlen) # create and allocate memory for it
vec.fill(fval)
discard sum_vec_f64(vec) # ignore the return value
proc dobench(n: int, vlen: int, fval: float64) =
for i in 1 .. n:
vec_bench(vlen, fval)
# run it and ignore the return value
var fval: float64 # value for element of vector
var n: int # number of runs
var vlen: int # length of vector
if paramCount() == 3:
n = parseInt(paramStr(1))
vlen = parseInt(paramStr(2))
fval = parseFloat(paramStr(3))
dobench(n, vlen, fval)
else:
quit("Requires exactly 3 integer parameters. Exiting.")
Gambit-Scheme
;; #!/usr/bin/env gsi-script -:d0
;; compile as gsc -exe <source filename>
; this version intended to be used with Gambit Scheme
; not guaranteed to work with other versions of Scheme
(declare (block) ; mysterious incantation to enable inlining of procedures
(standard-bindings) ; per reco. for improved perf.
(extended-bindings) ; ditto
(not safe)) ; ditto
; sum the vector
(define (f64vector-sum vec)
(let ((len (f64vector-length vec)))
(do ((i (fx- len 1) (fx- i 1))
(sum 0.0 (fl+ sum (f64vector-ref vec i))))
((fx< i 0) sum))))
(define (f64vec-bench len fval)
(let ((vec (make-f64vector len fval)))
(f64vector-sum vec)))
(define (repeat n len fval)
(do ((i 0 (fx+ i 1)))
((fx= i n))
(f64vec-bench len fval)))
(time (repeat 1000 100000 0.5)) ; do a sum 1000 times
sbcl
; the following make no difference on execution time
; (declaim (optimize (speed 3) (debug 0) (safety 0)))
; (declaim (inline vec-bench)) ; 30 to 40% performance improvement
; (declaim (inline vector-sum))
; sum the vector
(defun f64vector-sum-do-typed (vec)
(declare (type (simple-array double-float (*)) vec))
(let ((len (length vec))
(sum 0.0d0))
(declare (type double-float sum))
(do ((i 0 (+ i 1)))
((= i len) sum)
(setf sum (+ sum (aref vec i))))))
; same performance as preceding do loop style
(defun f64vector-sum-dotimes-typed (vec)
(declare (type (simple-array double-float (*)) vec))
(let ((sum 0.0d0))
(declare (type double-float sum))
(dotimes (i (length vec) sum)
(incf sum (aref vec i)))))
(defun f64vector-sum-reduce (vec) ; type declaration won't improve performance
(reduce #'+ vec))
; same performance as do loop, but nicer
(defun f64vector-sum-for (vec) ; loop macro makes type declaration ineffective
(loop for i across vec
sum i))
(defun f64vector-sum-loop-typed (vec) ; slower than do-style loops above
(declare (type (simple-array double-float (*)) vec))
(let ((sum 0.0d0))
(declare (type double-float sum))
(loop for i across vec
do (setq sum (+ sum i)))
sum))
(defun f64vec-bench (len fval)
(let ((vec (make-array len :element-type 'double-float :initial-element fval)))
(f64vector-sum-do-typed vec)))
(defun repeat (n len fval)
(do ((i 0 (+ i 1)))
((= i n))
(f64vec-bench len fval)))
(repeat 1000 100000 0.5d0) ; create vector and sum 1000 times
racket
#lang racket/base
; (require ffi/vector) not being used
(require racket/flonum)
; 4 different ways to sum a vector
; pass the function name as input to repeat
; 1. sum the vector in a do loop
(define (vector-sum-do vec)
(let ((len (flvector-length vec))
(sum 0.0))
(do ((i 0 (add1 i)))
((= i len) sum)
(set! sum (fl+ sum (flvector-ref vec i))))))
; 2. sum the vector with a naive for/sum iterator:
(define (vector-sum-for/sum vec) ; much slower
(for/sum ((sum vec)) sum))
; 3. sum the vector in a for/sum iterator iterating the vec with in-flvector
(define (vector-sum-for/sum-infl vec)
(for/sum ([sum (in-flvector vec)])
sum))
; 4. sum the vector using for/fold: concise, idiomatic, fast
(define (vector-sum-for/fold vec)
(fl+ ;; hint to compiler to unbox variable sum
(for/fold ([sum 0.0])
([v (in-flvector vec)])
(fl+ sum v))
))
; 4.(a) alternative for/fold, with more idiomatic way to provide hint to unbox sum
(define (vector-sum-for/fold-result vec)
(for/fold ([sum 0.0] #:result (fl+ sum)) ;; hints: sum is float; for/fold is float
([v (in-flvector vec)])
(fl+ sum v)))
(define (vector-bench func len fval)
(let ((vec (make-flvector len fval)))
(func vec)))
(define (repeat func n len fval)
(do ((i 0 (add1 i)))
((= i n))
(vector-bench func len fval)))
; nicer way to express the repeat loop: no additional performance gain
; (define (repeat func n len fval)
; (for ([_ (in-range 0 n)])
; (vector-bench func len fval)))
; use as follows:
(time (repeat vector-sum-for/fold 1000 100000 0.5)) ; create vector 1000 times and sum
(#%declare #:unsafe)