Micro-benchmark

Here is a micro-benchmark of Julia, Nim, sbcl, Gambit-Scheme, and Racket for summing a vector of floating point values. Not really any feedback for Julia: this doesn’t warrant finding any extra performance tricks for Julia. Note that Racket discourse community was extremely helpful. I don’t think the final recommended code is super-obvious, but it is documented in the Racket Guide performance hints section. Gambit-Scheme developer Marc Feeley was very helpful, too.

Nothing surprising here. Startling how much simpler Julia makes this. Also, my bumpy Lisp/Scheme voyage makes clear how many benefits Julia derived from coming out of MIT’s PL:Scheme heritage–along with Julia’s many wise departures from Scheme.

Micro-Benchmark of Summing a Floating Point Vector

Take this with a grain of salt. All micro-benchmarks are fraught; it’s easier to do wrong than to do correctly. Please correct anything I have done incorrectly or suggest a better approach, as long as it’s not particularly intricate code: it does matter that good performance can be achieved with straight forward code. For this trivial example, it is in all the languages.

This benchmark compares summing a floating point array in the following languages:

Julia: Design around matrix arithmetic; relatively easy to implement strong typing (though not required)

Nim: a nice c/c++ substitute; requires typing every variable; static compilation

SBCL: a performant lisp which makes typing possible; REPL statements compiled before execution. Setting explicit types resulted in a nearly 5x speedup.

Gambit: a scheme that compiles via c to object code; typing possible but optional

Racket: a scheme with extensive libraries; typing possible but optional; REPL statements compiled before execution; functions in the flonum library already require floating point arguments so explicit type hints didn’t further improve performance.

The code for each language appears after the benchmarks. Benchmarks were run on my 2021 MacBook Pro M1. Benchmark timing varies quite a bit depending on other process running on the machine, even if the benchmarks are run in quick succession–yet another reason for healthy dose of salt.

Language Timing in seconds for 1000 executions
Julia .07 (using BenchmarkTools package)
Nim .04 (second run, compile as unsafe)
Nim .07 (compile as release)
Gambit compiled .51
SBCL in REPL 1.2 (no type declarations)
SBCL in REPL .255 (type declarations)
Racket in REPL .17 (for/fold comprehension)
Racket executable Same as in REPL

Comments

Using somewhat non-obvious tricks, Racket is the undisputed Lisp/Scheme champion. AOT compilation in the REPL is faster than compiled Gambit-Scheme for both sbcl and Racket. Of course, it’s not even fair to compare to Julia and Nim, both of which are strongly typed languges. Julia can also be dynamically typed, but it is usually easy to get “type stability” with Julia’s excellent type inference.

I got typed/racket to work, but it did not reduce execution time. My theory–awaiting some comments from experts is that the flonum library functions, fl+ and in-flvector, are already specialized on specific types, so explicit type hinting won’t enable the compiler to optimize them.

I received some suggestions from Gambit’s developer, which provided a 17% improvement.

Look at the code below. All the performance results are reasonable, but which language would you choose for numerical work?

Code:

Julia

function bench_vec_sum(len, fval)
    v = fill(fval, len)
    return sum(v)
end

Nim

# compile with nim c -d:danger --app:console --opt:speed --cc:clang --mm:orc  benchmark_vector.nim

import std/algorithm  # for fill
import os             # for cmdline parameters
import std/strutils   # to parse cmdline parameters

proc sum_vec_f64(vec: seq[float64]): float64 =    # returns float64
  var sum: float64 = 0.0
  for x in vec:   # iterator is faster than calculating index and referencing each slot
    sum += x
  return sum


proc vec_bench(vlen: int, fval: float64) =
  var vec: seq[float64]
  newseq(vec, vlen) # create and allocate memory for it
  vec.fill(fval)
  discard sum_vec_f64(vec) # ignore the return value

    
proc dobench(n: int, vlen: int, fval: float64) =
  for i in 1 .. n:
    vec_bench(vlen, fval)
    

# run it and ignore the return value
var fval: float64 # value for element of vector
var n: int        # number of runs
var vlen: int     # length of vector
if paramCount() == 3:
  n = parseInt(paramStr(1))
  vlen = parseInt(paramStr(2))
  fval = parseFloat(paramStr(3))
  dobench(n, vlen, fval)
else:
  quit("Requires exactly 3 integer parameters. Exiting.")

Gambit-Scheme

;; #!/usr/bin/env gsi-script -:d0
;; compile as gsc -exe <source filename>

; this version intended to be used with Gambit Scheme
;   not guaranteed to work with other versions of Scheme


(declare (block)    ; mysterious incantation to enable inlining of procedures
  (standard-bindings)   ; per reco. for improved perf.
  (extended-bindings)   ; ditto
  (not safe))            ; ditto

; sum the vector
(define (f64vector-sum vec)
  (let ((len (f64vector-length vec)))
    (do ((i (fx- len 1) (fx- i 1))
         (sum 0.0 (fl+ sum (f64vector-ref vec i))))
        ((fx< i 0) sum))))

(define (f64vec-bench len fval)
  (let ((vec (make-f64vector len fval)))
        (f64vector-sum vec)))

(define (repeat n len fval)
  (do ((i 0 (fx+ i 1)))
    ((fx= i n))
    (f64vec-bench len fval)))


(time (repeat 1000 100000 0.5)) ; do a sum 1000 times

sbcl

; the following make no difference on execution time
; (declaim (optimize (speed 3) (debug 0) (safety 0)))
; (declaim (inline vec-bench)) ; 30 to 40% performance improvement
; (declaim (inline vector-sum))

; sum the vector
(defun f64vector-sum-do-typed (vec)
    (declare (type (simple-array double-float (*)) vec))
  (let ((len (length vec))
        (sum 0.0d0))
        (declare (type double-float sum))
    (do ((i 0 (+ i 1)))
	((= i len) sum)
      (setf sum (+ sum (aref vec i))))))

; same performance as preceding do loop style
(defun f64vector-sum-dotimes-typed (vec)
  (declare (type (simple-array double-float (*)) vec))
  (let ((sum 0.0d0))
    (declare (type double-float sum))
    (dotimes (i (length vec) sum)
      (incf sum (aref vec i)))))


(defun f64vector-sum-reduce (vec) ; type declaration won't improve performance
    (reduce #'+ vec))

; same performance as do loop, but nicer
(defun f64vector-sum-for (vec) ; loop macro makes type declaration ineffective
    (loop for i across vec
          sum i))

(defun f64vector-sum-loop-typed (vec)  ; slower than do-style loops above
  (declare (type (simple-array double-float (*)) vec))
  (let ((sum 0.0d0))
    (declare (type double-float sum))
    (loop for i across vec
          do (setq sum (+ sum i)))
    sum))

(defun f64vec-bench (len fval)
    (let ((vec (make-array len :element-type 'double-float :initial-element fval)))
        (f64vector-sum-do-typed vec)))

(defun repeat (n len fval)
  (do ((i 0 (+ i 1)))
    ((= i n))
    (f64vec-bench len fval)))

(repeat 1000 100000 0.5d0) ; create vector and sum 1000 times

racket

#lang racket/base

; (require ffi/vector) not being used
(require racket/flonum)

; 4 different ways to sum a vector
; pass the function name as input to repeat

; 1. sum the vector in a do loop
(define (vector-sum-do vec)
 (let ((len (flvector-length vec))
       (sum 0.0)) 
   (do ((i 0 (add1 i)))
	((= i len) sum)
     (set! sum (fl+ sum (flvector-ref vec i))))))


; 2. sum the vector with a naive for/sum iterator:
(define (vector-sum-for/sum vec) ; much slower
  (for/sum ((sum vec)) sum))

; 3. sum the vector in a for/sum iterator iterating the vec with in-flvector
(define (vector-sum-for/sum-infl vec)
  (for/sum ([sum (in-flvector vec)])
    sum))


; 4. sum the vector using for/fold: concise, idiomatic, fast
(define (vector-sum-for/fold vec)
  (fl+                     ;; hint to compiler to unbox variable sum
    (for/fold ([sum 0.0])
              ([v (in-flvector vec)])
      (fl+ sum v))
  ))

; 4.(a) alternative for/fold, with more idiomatic way to provide hint to unbox sum
(define (vector-sum-for/fold-result vec)
  (for/fold ([sum 0.0] #:result (fl+ sum)) ;; hints: sum is float; for/fold is float
            ([v (in-flvector vec)])
    (fl+ sum v)))

(define (vector-bench func len fval)
  (let ((vec (make-flvector len fval)))
    (func vec)))

(define (repeat func n len fval)
  (do ((i 0 (add1 i)))
    ((= i n))
    (vector-bench func len fval)))

; nicer way to express the repeat loop: no additional performance gain
; (define (repeat func n len fval)
;   (for ([_ (in-range 0 n)])
;     (vector-bench func len fval)))


; use as follows:
(time (repeat vector-sum-for/fold 1000 100000 0.5)) ; create vector 1000 times and sum
(#%declare #:unsafe)


3 Likes

Are you purposefully wanting to include the time it takes to allocate an array before the summation? You can just do bench_sum(len, fval) = sum(_ -> fval, len) which will be a lot faster since you won’t be spending a bunch of time allocating an array that you then sum over.

Asuming you do want it to include the time to construct the vector though, here’s one thing you can do to speed up the julia code:

using Bumper, StrideArrays
function bench_vec_sum_bumper(len, fval::T) where {T}
    buf=default_buffer()
    @no_escape buf begin
        v = alloc(T, buf, len) .= fval
        sum(v)
    end
end

Here’s a comparison between this version and the original

julia> @benchmark bench_vec_sum_bumper(100000, 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.377 μs …  47.770 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     14.648 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.755 μs ± 805.896 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▆█▇▅▂▂▃▁                                                   ▂
  ▇█████████▆▆▅▆▄▅▅▅▅▅▄▃▄▁▃▃▅▅▆▅▅▄▁▁▄▄▅▄▄▅▁▃▁▃▃▁▄▄▃▆▅▄▁▃▅▃▄▁▁▃ █
  14.4 μs       Histogram: log(frequency) by time      18.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark bench_vec_sum(100000, 0.5)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.035 μs … 537.247 μs  ┊ GC (min … max):  0.00% … 68.20%
 Time  (median):     52.149 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   60.856 μs ±  56.494 μs  ┊ GC (mean ± σ):  10.24% ±  9.91%

  ██▃         ▁▁                                               ▂
  ███▇▄▁▁▁▁▁▁▁██▇▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▆ █
  45 μs         Histogram: log(frequency) by time       469 μs <

 Memory estimate: 781.30 KiB, allocs estimate: 2.

so that’s a handly little 3x speedup.

3 Likes

All benchmarks seem to include the allocation, so this is just fair.

A fair point to be sure, but I wanted to include the allocation time (over and over). While the OS responds to a request to give heap memory, how languages manage that has a big impact on performance.

So, all the benchmarks for all of the languages include that. In Julia, could I create an uninitialized array and then fill it or .= assign it faster? I didn’t think so.

It’s not just doing that:

And not just additionally allocating. I’m not saying it’s unfair to include allocation (and deallocation), just that you have to realize you’re allocating both, likely all three, plus actually GC overhead, tracing. That’s for default use of Julia, so not unfair. But the GC can be avoided, and the default allocator as already shown.

You have relatively different overheads depending on how large len is, and additionally you have have a problem if the array doesn’t fit in L# (or even L1?) cache?

julia> @benchmark bench_vec_sum(100, 3)
BenchmarkTools.Trial: 10000 samples with 891 evaluations.
Range (min … max): 100.899 ns … 1.347 μs ┊ GC (min … max): 0.00% … 65.02%
Time (median): 162.260 ns ┊ GC (median): 0.00%
Time (mean ± σ): 187.831 ns ± 118.051 ns ┊ GC (mean ± σ): 6.31% ± 8.78%

 ██▆▅▄▃▁                                                    ▂

▆▄▄████████▆▅▆▅▆▆▅▄▆▆▆▆▆▅▇▇▆▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▅▅▄▆▇▆▃▆▆ █
101 ns Histogram: log(frequency) by time 991 ns <

Memory estimate: 896 bytes, allocs estimate: 1.

For sum you must scan all the array, from 1 and up, a different direction can give a different answer with floating point. You can use threads, it might help.

But this got me thinking, fill might fill the array in the same direction, I haven’t confirmed, but it seems so. But for cache reasons it would help to fill in the other direction (from sum), at least if doesn’t fit in L3 cache.

I’d consider 100,000 small. If 1,000,000 is sure to blow out the cache, then I’ll do that. But, I am not trying to benchmark hardware implementation of memory or floating point. It’s a crude benchmark of code. And, yes, allocating the vector counts.

No, fill(fval, n) should be just as fast as Vector{T}(undef, n) .= fval. What I did in the above code with Bumper.jl was make it faster to allocate/deallocate the vector using a technique called a ‘bump allocator’ or ‘arena allocator’.