Performance benefits of multiple dispatch

OK, so I thought it’d be a great idea to demonstrate to students on their very first lesson the performance benefits of Julia’s Top Notion: multiple dispatch. So I wrote the program shown below. Only problem is, the results (even after calling several times to bypass compilation) seem to indicate that the code with dispatch runs about four times slower than the code that substitutes a conditional for the dispatching. What am I doing wrong?! :dizzy_face: :astonished:

Thanks,
Niall.

abstract type Species end

struct Elephant <: Species end
struct Wolf <: Species end
struct Deer <: Species end
struct Bush <: Species end
struct GroundPlant <: Species end

function interactnodispatch( a::Species, b::Species)
	if a isa Elephant && b isa Deer ||
		a isa Elephant && b isa Deer
		"Competition!"
	elseif a isa Elephant && b isa Bush ||
		a isa Bush && b isa Elephant
		"Eating!"
	elseif a isa Elephant && b isa GroundPlant ||
		a isa GroundPlant && b isa Elephant
		"Trample!"
	elseif a isa Deer && b isa GroundPlant ||
		a isa GroundPlant && b isa Deer
		"Eating"
	elseif a isa Wolf && b isa Deer ||
		a isa Deer && b isa Wolf
		"Eating!"
	elseif a isa Bush && b isa GroundPlant ||
		a isa GroundPlant && b isa Bush
		"Competition!"
	else
		"No interaction!"
	end
end

interactdispatch( a::Species, b::Species) = "No interaction!"
interactdispatch( a::Elephant, b::Deer) = "Competition!"
interactdispatch( a::Deer, b::Elephant) = "Competition!"
interactdispatch( a::Elephant, b::Bush) = "Eating!"
interactdispatch( a::Elephant, b::GroundPlant) = "Trample!"
interactdispatch( a::GroundPlant, b::Elephant) = "Trample!"
interactdispatch( a::Deer, b::GroundPlant) = "Eating!"
interactdispatch( a::GroundPlant, b::Deer) = "Eating!"
interactdispatch( a::Wolf, b::Deer) = "Eating!"
interactdispatch( a::Deer, b::Wolf) = "Eating!"
interactdispatch( a::Bush, b::GroundPlant) = "Competition!"
interactdispatch( a::GroundPlant, b::Bush) = "Competition!"

function testnodispatch(species::Vector,n::Int)
	testtime = 0.0
	for _ in 1:n
		testtime += @elapsed interactnodispatch(rand(species),rand(species))
	end
	
	testtime
end

function testdispatch(species::Vector,n::Int)
	testtime = 0.0
	for _ in 1:n
		testtime += @elapsed interactdispatch(rand(species),rand(species))
	end
	
	testtime
end

function test(n::Int=10000000)
	species = [Elephant(),Deer(),Wolf(),Bush(),GroundPlant()]
	print("Test without dispatch: ")
	println( testnodispatch(species,n))
	print("Test with dispatch: ")
	println( testdispatch(species,n))
end

Try doing the benchmarks with BenchmarkTools.jl.

I would, btw, expect both functions to have the same performance. The compiler will compile specialized methods for each set of inputs, removing the unused branches, meaning that in the compiled code there will be no branches or conditions.

Oh, I just realized that you have a vector of different types, so the compiler cannot remove the branches.

In this case, you get dynamic dispatch, which is slow. Conditionals should be faster in this case.

4 Likes

I think the key to this problem is the fact that you’re calling the function with a randomized type. When the function testdispatch is called, the compiler does not actually know what species type interactdispatch will be called with (since it could be any kind). This means the compiler cannot actually use dispatch at compile time and thus not do any performance optimization.
I think the runtime is maybe even worse because in this case, β€œspecies” is an abstract type, so the compiler perhaps does’nt even know that it has to be either one of the species in your list, as is done in the explicit β€œif else” consideration (not 100 percent sure about this).
The overall problem can be seen when calling

julia> @code_warntype testdispatch([Elephant(),Deer(),Wolf(),Bush(),GroundPlant()],10000)

In my editor the lines

β”‚   %15 = Main.rand(species)::Species
β”‚   %16 = Main.rand(species)::Species

are marked red, i.e. the compiler does not know about the concrete type of species at compile time.
From the top of my head I cannot think of a good solution that is faithful to the original problem (i.e. random types at runtime). Maybe there is a smart way to experiment further with function barriers.
Edit: ah, seems like @DNF beat me to it :smiley:

4 Likes

The main performance benefits in Julia occur when all types are fully inferred, and methods are specialized. With concrete types you get:

1.7.0> @btime interactdispatch(Elephant(), Deer())
  1.400 ns (0 allocations: 0 bytes)
"Competition!"

1.7.0> @btime interactnodispatch(Elephant(), Deer())
  1.400 ns (0 allocations: 0 bytes)
"Competition!"

As you see, fast and the same performance. When you randomize the type, so the compiler cannot predict it, you lose that performance, and the dispatch version is even worse:

1.7.0> @btime interactnodispatch(rand($species), rand($species))
  26.888 ns (0 allocations: 0 bytes)
"No interaction!"

1.7.0> @btime interactdispatch(rand($species), rand($species))
  110.671 ns (0 allocations: 0 bytes)
"No interaction!"
3 Likes

OK, thanks both of you. On the basis of your comments I had just got it to the stage where the execution times are equal (by specifying types). However that still leaves me with the issue of how to demonstrate performance benefits of multiple dispatch. I really need a case where multiple dispatch comes out on top over intra-code conditionals. Anyone have any ideas?

In cases where the types are known at compile time, there isn’t necessarily any performance benefits. It is rather a question of improved expressivity.

You could probably get the same performance just by writing conditionals, but over time that approach becomes untenable, while multiple dispatch with external methods allows you to easily extend and specialize your code for new types.

One could perhaps say that multiple dispatch makes your code clean and extensible, while type specialization makes it fast. Multiple dispatch makes it easy to reap the performance benefits provided by type specialization.

9 Likes

You can use tuples, then the dispatch should happen at compile time. Something like this:

species_pairs = (
    Elephant() => Wolf(),
    Deer() => Bush(),
    ...
)
map(species_pairs) do p
    inrearctdispatch(p[1], p[2])
end
1 Like

Now that’s an interesting idea - I’ll try that, but first I need to go shopping for the week, so I’ll have to get back to you. :grinning:

it’s very hard because Julia has the optimization which turns your β€œconditional” code into multiple-dispatch methods

2 Likes

Well, the tuple idea worked well - the types are now concrete and so I get similar runtimes for with and without dispatch over 1e7 iterations. However I’m intrigued by the fact that the time with dispatch is now consistently around 1% faster using conditional rather than dispatch. I’m wondering whether this is @jling 's point - that conditionals are optimised away.

But all of this means my wonderful idea of making first-year students wake up and take a look at multiple dispatch isn’t working at all. For first-time programmers, faster execution is sexy, but code readability (unfortunately) isn’t. I had hoped to hook them on expressibility by coupling it to performance.

Ah well - thanks so much for the ideas.

Best wishes,
Niall.

2 Likes

You can block constant propagation:

julia> @benchmark interactnodispatch($(Ref(Elephant()))[], $(Ref(Deer()))[])
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.643 ns … 90.001 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.665 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.850 ns Β±  1.165 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ β–„ β–‡β–„                                ▁▁▁▁ ▁ ▁ ▁ ▁▁▁       β–‚
  β–ˆβ–†β–ˆβ–ƒβ–ˆβ–ˆβ–β–„β–„β–β–ƒβ–„β–ƒβ–β–ƒβ–„β–„β–ƒβ–ƒβ–„β–„β–ƒβ–†β–„β–†β–†β–‡β–†β–‡β–‡β–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡ β–ˆ
  1.64 ns      Histogram: log(frequency) by time     2.88 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark interactdispatch($(Ref(Elephant()))[], $(Ref(Deer()))[])
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.632 ns … 14.171 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.691 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.747 ns Β±  0.777 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–‚                     β–ƒβ–ˆ  β–‚                            
  β–‚β–β–β–β–‚β–ˆβ–ƒβ–‚β–‡β–‚β–β–‚β–‚β–ƒβ–‚β–β–β–β–β–β–β–β–β–β–β–β–„β–ˆβ–ˆβ–‚β–…β–ˆβ–ˆβ–ƒβ–‚β–β–β–β–‚β–‚β–β–β–‚β–‚β–β–β–β–β–ƒβ–ƒβ–‚β–ƒβ–ƒβ–β–‚β–ƒβ–β–ƒ β–ƒ
  1.63 ns        Histogram: frequency by time        1.75 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

And the 1% difference you see may well be within noise.

Uhm… let them take a stab then at this:

sexy fast code
# Small assembly program that calculates & prints the Fibonacci series

maxTerms    equ 15000	    ; number of terms of the series to calculate
digits	    equ (maxTerms*209+1673)/1000	
cntDigits   equ 6	        ; number of digits for counter
org         100h            ; this is a DOS com file

main:	

	mov	ax,'00'		; initialize to all ASCII zeroes
	mov	di,counter		; including the counter
	mov	cx,digits+cntDigits/2	; two bytes at a time
	cld			; initialize from low to high memory
	rep	stosw		; write the data
	inc	ax		; make sure ASCII zero is in al
	mov	[num1 + digits - 1],al ; last digit is one
	mov	[num2 + digits - 1],al ; 
	mov	[counter + cntDigits - 1],al

	jmp	.bottom		; done with initialization, so begin

.top
	; add num1 to num2
	mov	di,num1+digits-1
	mov	si,num2+digits-1
	mov	cx,digits	; 
	call	AddNumbers	; num2 += num1
	mov	bp,num2		;
	call	PrintLine	;
	dec	dword [term]	; decrement loop counter
	jz	.done		;

	; add num2 to num1
	mov	di,num2+digits-1
	mov	si,num1+digits-1
	mov	cx,digits	;
	call	AddNumbers	; num1 += num2
.bottom
	mov	bp,num1		;
	call	PrintLine	;
	dec	dword [term]	; decrement loop counter
	jnz	.top		;
.done
	call	CRLF		; finish off with CRLF
	mov	ax,4c00h	; terminate
	int	21h		;

PrintLine:
	mov	dx,eol		; print combined CRLF and msg1
	mov	cx,msg1len+eollen   ; 
	call	PrintString	;

	mov	di,counter	; print counter
	mov	cx,cntDigits	;
	call	PrintNumericString

	call	IncrementCount	; also increment the counter

	mov	dx,msg2		; print msg2
	mov	cx,msg2len	;
	call	PrintString	;
	
	mov	di,bp		; recall address of number
	mov	cx,digits	;
	; deliberately fall through to PrintNumericString

PrintNumericString:
	; first scan for the first non-zero byte
	mov	al,'0'		; look for ASCII zero
	cld			; scan from MSD to LSD
	repe	scasb		;
	mov	dx,di		; points to one byte after
	dec	dx		; back up one character
	inc	cx		;
	; deliberately fall through to PrintString

PrintString:
	mov	bx, 1		; write to stdout
	mov     ah, 040h        ; write to file handle
	int	21h		; ignore return value
	ret			;

AddNumbers:
	std			; go from LSB to MSB
	clc			;
	pushf			; save carry flag
.top
	mov	ax,0f0fh	; convert from ASCII BCD to BCD
	and  	al,[si]		; get next digit of number2 in al
	and	ah,[di]		; get next digit of number1 in ah
	popf			; recall carry flag
	adc	al,ah		; add these digits
	aaa			; convert to BCD
	pushf			;
	add	al,'0'		; convert back to ASCII BCD digit
	stosb			; save it and increment both counters
	dec	si		;
	loop	.top		; keep going until we've got them all
	popf			; recall carry flag
	ret			;

IncrementCount:
	mov	cx,cntDigits	;
	mov	di,counter+cntDigits-1
	std			; go from LSB to MSB
	stc			; this is our increment
	pushf			; save carry flag
.top
	mov	ax,000fh	; convert from ASCII BCD to BCD
	and	al,[di]		; get next digit of counter in al
	popf			; recall carry flag
	adc	al,ah		; add these digits
	aaa			; convert to BCD
	pushf			;
	add	al,'0'		; convert back to ASCII BCD digit
	stosb			; save and increment counter
	loop	.top		;
	popf			; recall carry flag
	ret			;
	
CRLF:	mov	dx,eol		;
	mov	cx,eollen	;
	jmp	PrintString	;

eol	db  13,10		; DOS-style end of line
eollen	equ $ - eol

msg1	db  'Fibonacci('	;
msg1len	equ $ - msg1

msg2	db  '): '		;
msg2len	equ $ - msg2

counter:			;
num1 equ counter+cntDigits	;
num2 equ num1+digits		;

# Disclaimer: no clue if this program actually runs...
1 Like

3 Likes

I blocked constant propagation in various different ways, and interactdispatch() runs consistently just slightly slower than interactnodispatch(). The difference isn’t much - a bit under 1% - but it is so consistent that I believe it is real. Interesting, but not helpful for my current issue, which is to sensitise complete novice programmers for multiple dispatch. I think, rather than focusing on performance, I will instead try to create an simple example that illustrates code extensibility.

Thanks very much for all the information - I’ve learned a lot. :smiley:

2 Likes

Performance was never really the selling point of multiple dispatch if you ask me, I don’t know where you got that from but I have seen a few times some weird performance comparisons. As you have learned from this thread, the compiler is a beast, so it’s not straightforward to understand what’s happening, especially for newcomers :wink: So I’d rather not start with this feature for newcomers, they’ll be more confused than fascinated.

Maybe you can outline the --in my opinion-- best features of multiple dispatch: code reuse, readability and extensibility.
If you want to get some inspiration, Stefan’s talk on JuliaCon 2019 titled β€œThe Unreasonable Effectiveness of Multiple Dispatch” is a great example.

7 Likes

Now that was a really useful tip - thanks. He distinguishes nicely between reuse of functions across types and reuse of types across functions. I think I can illustrate that simply with code that will be accessible to beginners. I’ll go for it.

Regarding performance benefits of multiple dispatch, I think I just kind of assumed that the dispatcher would manage it more efficiently than my clumsy conditionals, but clearly I was wrong. Ah well - learned something new!

One final point: I’m still unclear why using tuples makes things quicker. map really does seem to work phenomenally quicker on tuples than on vectors. Why? Have I understood @aplavin correctly that it forces compile-time dispatching? So why is that not the case for vectors?

I think the most important point is the reuse across libraries. I often show new-comers how you can for example easily extend the rand() function with your own type (c.f. Random Numbers Β· The Julia Language).

I’ll try to explain it a little bit simplified: the element types of a tuple are defined upon creation. Although eltype() returns Any for a tuple with mixed types and no other common supertype, it still has full information of the types it contains during its lifetime.

In contrast to an Array{Any}, which you get when you mix types (with Any being the only common supertype). Julia cannot know what you potentially throw into that container later on.

So in case of a tuple, the compiler knows each elements for sure until the object gets garbage collected. In case of an Array{Any}, the types even of existing elements might change any time. If you have an Array{SomeAbstractType}, the situation is similar. Julia will only know that objects in that array will inherit from SomeAbstractType, but the actual leaf-type is only known at runtime.

julia> t = (23, "hello")
(23, "hello")

julia> typeof(t)
Tuple{Int64, String}

julia> v = [23, "foo"]
2-element Vector{Any}:
 23
   "foo"

julia> typeof(v)
Vector{Any} (alias for Array{Any, 1})

julia> eltype(t)  # gives the only type which is a supertype of all elements
Any

julia> eltype(v)
Any

julia> ht = (1, 2, 3)  # "homogeneous tuple"
(1, 2, 3)

julia> typeof(ht)
Tuple{Int64, Int64, Int64}

julia> eltype(ht)
Int64

julia> aht = (1.2, 1)  # choosing some "Real" numbers
(1.2, 1)

julia> typeof(aht)
Tuple{Float64, Int64}

julia> eltype(aht)  # gives "Real" as a supertype for all elements 
Real

julia> av = [1, 1.0]  # notice that for arrays, Julia might promote your literals automagically ;)
2-element Vector{Float64}:
 1.0
 1.0

I hope this helps.

3 Likes

There is a great demo of multiple dispatch for Julia neophytes in the notebook here.

3 Likes