Why I love Julia

kolia · October 4, 2019, 3:29pm

I just gave a talk at my company attempting to explain why we are all so excited about Julia in one hour. I tried to focus on the essence of it, borrowing material from the manual, Stephan’s “Unreasonable effectiveness of multiple dispatch” talk and others.

Thought I’d share the nextjournal notebook I used to present, for others to fork and reuse.

Notifications of inaccuracies and omissions welcome.

evanfields · October 4, 2019, 6:45pm

I am expecting to give a similar talk soon so this will be very useful. Thanks!

Glen · October 5, 2019, 12:27am

Hi,

Nice work. For your summed case you could write it like this:

function summed(a)
    result = 0.0  # note 0.0
    @simd for x in a  # note @simd
        result += x
    end
    return result
end

This is fast as the built-in sum on 1.2 (for me at least). The built-in sum is more generic and handles more corner cases. I first tried changing 0 to 0.0 to match the type of the elements of a but it didn’t speed up (so 1.2 must be smarter than previous versions).

Glen

kolia · October 5, 2019, 10:43am

When I give the talk I explain that this is a naive impl of sum, and I show that for example a better init for result would be zero(eltype(a))

I tried slapping a @simd in front of the for loop but it doesn’t speed it up in the notebook env which is 1.2. What array size did you try this with to see a speed up? I’m guessing the speed up for naively putting @simd in front of the for loop is dependent on size(a), whereas The recursive binary divide and conquer in the Base.sum impl gives you a speed up on a broad range of sizes, probably? Anyone have some insight here?

Sukera · October 5, 2019, 11:23am

The divide and conquer algorithm in Base for floating point summation is used to reduce accumulation error as far as I know. ~~Naive linear summation has O(n) additions whereas the one in base does O(log n).~~

kolia · October 5, 2019, 1:00pm

Isn’t the number of non leaf nodes in a full binary tree with n leaves n-1? It’s still linear.

But maybe doing the additions in the bottom up binary tree order leads to less error?

SIMD and SIMD-intrinsics in Julia | Kristoffer Carlsson Is a good read. Still not entirely sure how the divide and conquer strategy helps.

Sukera · October 5, 2019, 5:12pm

Yes of course, my mistake - it’s purely to reduce summation error, not necessarily faster.

kolia · October 5, 2019, 6:52pm

Thanks for the explanation, pairwise summation it is!

The remaining mysteries are why prepending @simd does not speed up my naive loop when it does speed up yours, and why the @simd divide and conquer for loop in Base.sum does give me a speed up…

I’ll try playing with it until I figure it out

Glen · October 6, 2019, 12:52am

Hi kolia,

I used the same input as you. My versioninfo() returns:

Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)

Maybe why you don’t see a speed-up is related to your CPU or OS. Here’s my @code_native to compare with yours:

julia> @code_native summed(a)
	.text
; ┌ @ REPL[2]:3 within `summed'
; │┌ @ simdloop.jl:71 within `macro expansion'
; ││┌ @ simdloop.jl:51 within `simd_inner_length'
; │││┌ @ REPL[2]:2 within `length'
	movq	8(%rdi), %rax
; │└└└
; │┌ @ int.jl:49 within `macro expansion'
	testq	%rax, %rax
; └└
; ┌ @ simdloop.jl:72 within `summed'
	jle	L26
	movq	(%rdi), %rcx
; │ @ simdloop.jl:75 within `summed'
	cmpq	$16, %rax
	jae	L34
	vxorpd	%xmm0, %xmm0, %xmm0
	xorl	%edx, %edx
	jmp	L124
L26:
	vxorps	%xmm0, %xmm0, %xmm0
; └
; ┌ @ REPL[2]:6 within `summed'
	vzeroupper
	retq
; │ @ REPL[2]:3 within `summed'
; │┌ @ simdloop.jl:75 within `macro expansion'
L34:
	movq	%rax, %rdx
	andq	$-16, %rdx
	leaq	96(%rcx), %rsi
	vxorpd	%xmm0, %xmm0, %xmm0
; ││ @ simdloop.jl:78 within `macro expansion'
; ││┌ @ int.jl:53 within `+'
	movq	%rdx, %rdi
	vxorpd	%xmm1, %xmm1, %xmm1
	vxorpd	%xmm2, %xmm2, %xmm2
	vxorpd	%xmm3, %xmm3, %xmm3
; ││└
; ││ @ simdloop.jl:77 within `macro expansion' @ REPL[2]:4
; ││┌ @ float.jl:395 within `+'
L64:
	vaddpd	-96(%rsi), %ymm0, %ymm0
	vaddpd	-64(%rsi), %ymm1, %ymm1
	vaddpd	-32(%rsi), %ymm2, %ymm2
	vaddpd	(%rsi), %ymm3, %ymm3
; │└└
; │┌ @ int.jl:53 within `macro expansion'
	subq	$-128, %rsi
	addq	$-16, %rdi
	jne	L64
; │└
; │┌ @ simdloop.jl:77 within `macro expansion' @ REPL[2]:4
; ││┌ @ float.jl:395 within `+'
	vaddpd	%ymm0, %ymm1, %ymm0
	vaddpd	%ymm0, %ymm2, %ymm0
	vaddpd	%ymm0, %ymm3, %ymm0
	vextractf128	$1, %ymm0, %xmm1
	vaddpd	%ymm1, %ymm0, %ymm0
	vhaddpd	%ymm0, %ymm0, %ymm0
	cmpq	%rdx, %rax
; └└└
; ┌ @ simdloop.jl:75 within `summed'
	je	L158
L124:
	subq	%rdx, %rax
	leaq	(%rcx,%rdx,8), %rcx
	nopw	%cs:(%rax,%rax)
; └
; ┌ @ REPL[2]:3 within `summed'
; │┌ @ simdloop.jl:77 within `macro expansion' @ REPL[2]:4
; ││┌ @ float.jl:395 within `+'
L144:
	vaddsd	(%rcx), %xmm0, %xmm0
; ││└
; ││ @ simdloop.jl:75 within `macro expansion'
; ││┌ @ int.jl:49 within `<'
	addq	$8, %rcx
	addq	$-1, %rax
; ││└
	jne	L144
; │└
; │ @ REPL[2]:6 within `summed'
L158:
	vzeroupper
	retq
	nopw	%cs:(%rax,%rax)
; └

Glen

John_Gibson · October 6, 2019, 1:32am

You can do zero(typeof(a)) or more simply zero(a).

Alec_Loudenback · October 6, 2019, 3:38am

Nice presentation, @kolia!

I gave a talk (Nextjournal link) a few months ago to a less technical audience (I assume based on some of the elements of your notebook that your audience has more of a computer science background).

My audience was actuaries, who have more of a math and mild-programming via data-science background.

I also included a comparison to R vs Python, based on a combination of opinion and feedback on the Julia Slack:

Probably in the intervening months some updates needed (e.g. 2.5/3 stars for Parallelism with 1.3)

jling · October 6, 2019, 3:41am

tbf, I won’t call Julia ‘written’ in Julia, although, stdlib is mostly written in Julia

giordano · October 6, 2019, 9:34am

I didn’t know that R is written in Rust

anon92994695 · October 6, 2019, 10:26am

I’m pretty sure most of R is written in C - then again a lot may have changed since I last looked.

Tamas_Papp · October 6, 2019, 11:05am

Indeed: it is mostly R, C, and Fortran:

https://github.com/wch/r-source

Packages also use C++ and other languages. There is a project with Rust, but it appears to be dormant:

https://github.com/rustr/rustinr

Alec_Loudenback · October 6, 2019, 1:07pm

Yea, I think I meant to say “R itself” per its Wikipedia page.

kolia · October 6, 2019, 3:30pm

Finally got around to figuring it out, it was very simple.

Turns out that adding @simd without fixing the type mismatch between the result = 0 init and eltype(a) keeps @simd from doing its thing.

This does give the speedup, and it’s nice and clean and eltype-agnostic:

function sumsimd(a)
  result = zero(eltype(a))
  @simd for x in a
    result += x
  end
  return result
end

I added this @simd for x in a version to the nextjournal presentation notebook, and added the link to Pairwise Summation wikipedia page explaining why Base.sum does what it does.

kolia · October 6, 2019, 3:52pm

As you guessed, my audience was software engineers who work mostly with Scala and Python, and data scientists.

Julia is an easy sell to data science folks so I targeted the software folks. Julia is just a good general programming language generally, and I set out to explain why.

The presentation was met with a lot of nods and what’s the catch and when can we switch. The main dissent was some being horrified about the lack of privacy of defining new methods on functions like Base.print whenever you define a new datatype, and an ensuing discussion of what type piracy is, and why it hasn’t been an issue. Not everyone was convinced by the argument that the Julia community has been doing this for a while without the sky coming crashing down, and that we see more benefits from it than downsides…

pistacliffcho · October 7, 2019, 12:42am

Not to be a wet blanket, but given the R packages parallel and foreach at the native R level and RcppParallel at the Rcpp level, I have a little trouble giving R only one star for parallel support.

I will say that it is a bit easier in Julia. But truth be told, I found RcppParallel to more efficiently use multiple threads than Julia on the same machine with the same algorithm. I have no insight into the why of that and user error is a very real possibility.

Of course, the fact that you have to think about two different languages for parallezation in R instead of one points directly to the two language problem that Julia answers.

Glen · October 7, 2019, 2:32am

Hi Kolia,

Good to hear you figured it out. I like that you explained why Julia’s sum is so much more complicated because when I looked at it the first time I was totally confused and it didn’t give me the impression that Julia is easy to use.

Glen

Topic		Replies	Views
Julia speed Performance question	28	2225	March 14, 2024
A fast sum. Any downsides? Performance sum	18	958	December 16, 2024
How to speed up this simple code? Multithreading, simd, inbounds Performance	39	7021	January 29, 2019
Julia equivalent of Python's "fsum" for floating point summation General Usage python	54	7259	August 16, 2019
Compare julia sum to a cpp implementation - julia is extremely slow?! Performance question	35	1794	October 7, 2019

Why I love Julia

Related topics