Why I love Julia

I just gave a talk at my company attempting to explain why we are all so excited about Julia in one hour. I tried to focus on the essence of it, borrowing material from the manual, Stephan’s “Unreasonable effectiveness of multiple dispatch” talk and others.

Thought I’d share the nextjournal notebook I used to present, for others to fork and reuse.

Notifications of inaccuracies and omissions welcome.


I am expecting to give a similar talk soon so this will be very useful. Thanks!

1 Like


Nice work. For your summed case you could write it like this:

function summed(a)
    result = 0.0  # note 0.0
    @simd for x in a  # note @simd
        result += x
    return result

This is fast as the built-in sum on 1.2 (for me at least). The built-in sum is more generic and handles more corner cases. I first tried changing 0 to 0.0 to match the type of the elements of a but it didn’t speed up (so 1.2 must be smarter than previous versions).


1 Like

When I give the talk I explain that this is a naive impl of sum, and I show that for example a better init for result would be zero(eltype(a))

I tried slapping a @simd in front of the for loop but it doesn’t speed it up in the notebook env which is 1.2. What array size did you try this with to see a speed up? I’m guessing the speed up for naively putting @simd in front of the for loop is dependent on size(a), whereas The recursive binary divide and conquer in the Base.sum impl gives you a speed up on a broad range of sizes, probably? Anyone have some insight here?

1 Like

The divide and conquer algorithm in Base for floating point summation is used to reduce accumulation error as far as I know. Naive linear summation has O(n) additions whereas the one in base does O(log n).

1 Like

Isn’t the number of non leaf nodes in a full binary tree with n leaves n-1? It’s still linear.

But maybe doing the additions in the bottom up binary tree order leads to less error?

http://kristofferc.github.io/post/intrinsics/ Is a good read. Still not entirely sure how the divide and conquer strategy helps.

Yes of course, my mistake - it’s purely to reduce summation error, not necessarily faster.

Thanks for the explanation, pairwise summation it is!

The remaining mysteries are why prepending @simd does not speed up my naive loop when it does speed up yours, and why the @simd divide and conquer for loop in Base.sum does give me a speed up…

I’ll try playing with it until I figure it out

Hi kolia,

I used the same input as you. My versioninfo() returns:

Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)

Maybe why you don’t see a speed-up is related to your CPU or OS. Here’s my @code_native to compare with yours:

julia> @code_native summed(a)
; ┌ @ REPL[2]:3 within `summed'
; │┌ @ simdloop.jl:71 within `macro expansion'
; ││┌ @ simdloop.jl:51 within `simd_inner_length'
; │││┌ @ REPL[2]:2 within `length'
	movq	8(%rdi), %rax
; │└└└
; │┌ @ int.jl:49 within `macro expansion'
	testq	%rax, %rax
; └└
; ┌ @ simdloop.jl:72 within `summed'
	jle	L26
	movq	(%rdi), %rcx
; │ @ simdloop.jl:75 within `summed'
	cmpq	$16, %rax
	jae	L34
	vxorpd	%xmm0, %xmm0, %xmm0
	xorl	%edx, %edx
	jmp	L124
	vxorps	%xmm0, %xmm0, %xmm0
; └
; ┌ @ REPL[2]:6 within `summed'
; │ @ REPL[2]:3 within `summed'
; │┌ @ simdloop.jl:75 within `macro expansion'
	movq	%rax, %rdx
	andq	$-16, %rdx
	leaq	96(%rcx), %rsi
	vxorpd	%xmm0, %xmm0, %xmm0
; ││ @ simdloop.jl:78 within `macro expansion'
; ││┌ @ int.jl:53 within `+'
	movq	%rdx, %rdi
	vxorpd	%xmm1, %xmm1, %xmm1
	vxorpd	%xmm2, %xmm2, %xmm2
	vxorpd	%xmm3, %xmm3, %xmm3
; ││└
; ││ @ simdloop.jl:77 within `macro expansion' @ REPL[2]:4
; ││┌ @ float.jl:395 within `+'
	vaddpd	-96(%rsi), %ymm0, %ymm0
	vaddpd	-64(%rsi), %ymm1, %ymm1
	vaddpd	-32(%rsi), %ymm2, %ymm2
	vaddpd	(%rsi), %ymm3, %ymm3
; │└└
; │┌ @ int.jl:53 within `macro expansion'
	subq	$-128, %rsi
	addq	$-16, %rdi
	jne	L64
; │└
; │┌ @ simdloop.jl:77 within `macro expansion' @ REPL[2]:4
; ││┌ @ float.jl:395 within `+'
	vaddpd	%ymm0, %ymm1, %ymm0
	vaddpd	%ymm0, %ymm2, %ymm0
	vaddpd	%ymm0, %ymm3, %ymm0
	vextractf128	$1, %ymm0, %xmm1
	vaddpd	%ymm1, %ymm0, %ymm0
	vhaddpd	%ymm0, %ymm0, %ymm0
	cmpq	%rdx, %rax
; └└└
; ┌ @ simdloop.jl:75 within `summed'
	je	L158
	subq	%rdx, %rax
	leaq	(%rcx,%rdx,8), %rcx
	nopw	%cs:(%rax,%rax)
; └
; ┌ @ REPL[2]:3 within `summed'
; │┌ @ simdloop.jl:77 within `macro expansion' @ REPL[2]:4
; ││┌ @ float.jl:395 within `+'
	vaddsd	(%rcx), %xmm0, %xmm0
; ││└
; ││ @ simdloop.jl:75 within `macro expansion'
; ││┌ @ int.jl:49 within `<'
	addq	$8, %rcx
	addq	$-1, %rax
; ││└
	jne	L144
; │└
; │ @ REPL[2]:6 within `summed'
	nopw	%cs:(%rax,%rax)
; └


You can do zero(typeof(a)) or more simply zero(a).

Nice presentation, @kolia!

I gave a talk (Nextjournal link) a few months ago to a less technical audience (I assume based on some of the elements of your notebook that your audience has more of a computer science background).

My audience was actuaries, who have more of a math and mild-programming via data-science background.

I also included a comparison to R vs Python, based on a combination of opinion and feedback on the Julia Slack:

Probably in the intervening months some updates needed (e.g. 2.5/3 stars for Parallelism with 1.3)


tbf, I won’t call Julia ‘written’ in Julia, although, stdlib is mostly written in Julia

I didn’t know that R is written in Rust


I’m pretty sure most of R is written in C - then again a lot may have changed since I last looked.


Indeed: it is mostly R, C, and Fortran:

Packages also use C++ and other languages. There is a project with Rust, but it appears to be dormant:

Yea, I think I meant to say “R itself” per its Wikipedia page.

Finally got around to figuring it out, it was very simple.

Turns out that adding @simd without fixing the type mismatch between the result = 0 init and eltype(a) keeps @simd from doing its thing.

This does give the speedup, and it’s nice and clean and eltype-agnostic:

function sumsimd(a)
  result = zero(eltype(a))
  @simd for x in a
    result += x
  return result

I added this @simd for x in a version to the nextjournal presentation notebook, and added the link to Pairwise Summation wikipedia page explaining why Base.sum does what it does.

1 Like

As you guessed, my audience was software engineers who work mostly with Scala and Python, and data scientists.

Julia is an easy sell to data science folks so I targeted the software folks. Julia is just a good general programming language generally, and I set out to explain why.

The presentation was met with a lot of nods and what’s the catch and when can we switch. The main dissent was some being horrified about the lack of privacy of defining new methods on functions like Base.print whenever you define a new datatype, and an ensuing discussion of what type piracy is, and why it hasn’t been an issue. Not everyone was convinced by the argument that the Julia community has been doing this for a while without the sky coming crashing down, and that we see more benefits from it than downsides…

1 Like

Not to be a wet blanket, but given the R packages parallel and foreach at the native R level and RcppParallel at the Rcpp level, I have a little trouble giving R only one star for parallel support.

I will say that it is a bit easier in Julia. But truth be told, I found RcppParallel to more efficiently use multiple threads than Julia on the same machine with the same algorithm. I have no insight into the why of that and user error is a very real possibility.

Of course, the fact that you have to think about two different languages for parallezation in R instead of one points directly to the two language problem that Julia answers.

Hi Kolia,

Good to hear you figured it out. I like that you explained why Julia’s sum is so much more complicated because when I looked at it the first time I was totally confused and it didn’t give me the impression that Julia is easy to use.