Performance of functions returned from container

ExpandingMan · May 16, 2017, 2:16pm

I’m in the process of writing an NES emulator. Many of the steps of this are surprisingly performance critical, at least if you do a full simulation of the CPU, which is of course the easy way (on a semi-modern, cheap laptop CPU you only have about 1000 cycles of x86 to do 1 cycle of 6502, a lot of stuff has to happen in those 1000 cycles!). Anyway, my plan was to do this just as I would write any Julia program, assuming there wouldn’t be big performance issues resulting from multiple-dispatch, generics and the like. For the most part this has gone impressively well… until the very last step.

One of the critical things that needs to happen is that instructions (in the case of an NES these are UInt8s) need to get looked up. The only reasonable way I can think of doing this is to have a Vector or Dict of function references. However, the performance I’m getting from such functions is disappointing (likely prohibitively so).

See the following:

using BenchmarkTools

const dict = Vector{Function}(4)
# const dict = Dict{Int,Function}();  sizehint!(dict, 4)

f1(x::Int) = 1 + x
dict[1] = f1

f2(x::Int) = 1 - x
dict[2] = f2

f3(x::Int) = 1.0/x
dict[3] = f3

f4(x::Int) = x << 1
dict[4] = f4

const cdict = Int[1, 2, 3, 4]
# const cdict = Dict{Int,Int}(i=>i for i ∈ 1:4)

ref = @benchmarkable begin
    a1, a2 = cdict[1], cdict[2]
    f1(a1), f2(a2)
end

test = @benchmarkable begin
    dict[1](1), dict[2](2)
end

And the results:

julia> run(ref)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.000 ns (0.00% GC)
  median time:      25.000 ns (0.00% GC)
  mean time:        25.082 ns (0.00% GC)
  maximum time:     347.000 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> run(test)
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     218.000 ns (0.00% GC)
  median time:      231.000 ns (0.00% GC)
  mean time:        558.054 ns (0.00% GC)
  maximum time:     3.230 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

On the bright side the Dict lookups are extremely fast (it’s not much worse than using Vectors). I realize that ref isn’t quite equivalent to test, but I’d think they should be damn close, so why do I get 10 times worse performance? I realize there is likely to be some overhead in the compiler having to skip around between these functions, but I was not expecting it to be a factor of 10. This may render my little emulator project rather impossible.

So, some questions to those in-the-know:

Why?
Is there any way around this?
Is this a limitation of Julia or is there some fundamental reason why this is happening, so that I’d get similarly bad performance when doing something like this in any language?

Thanks ahead of time!

kristoffer.carlsson · May 16, 2017, 3:10pm

https://github.com/yuyichao/FunctionWrappers.jl

import FunctionWrappers: FunctionWrapper

using BenchmarkTools
const F64IntFunc = FunctionWrapper{Float64,Tuple{Int}}

const dict = Vector{F64IntFunc}(4);

const cdict = Int[1, 2, 3, 4]

f1(x::Int) = 1 + x
dict[1] = F64IntFunc(f1);

f2(x::Int) = 1 - x
dict[2] = F64IntFunc(f2);

f3(x::Int) = 1.0/x
dict[3] = F64IntFunc(f3);

f4(x::Int) = x << 1
dict[4] = F64IntFunc(f4);


ref = @benchmarkable begin
    a1, a2 = cdict[1], cdict[2]
    f1(a1), f2(a2)
end

test = @benchmarkable begin
    dict[1](1), dict[2](2)
end

run(ref)
run(test)

julia> run(ref)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.000 ns (0.00% GC)
  median time:      30.000 ns (0.00% GC)
  mean time:        30.841 ns (0.00% GC)
  maximum time:     11.913 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> run(test)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.000 ns (0.00% GC)
  median time:      37.000 ns (0.00% GC)
  mean time:        40.765 ns (0.00% GC)
  maximum time:     11.749 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Ran on 0.5. Seems to be some ambiguity problems on 0.6 but should be easily fixed in the package.

ExpandingMan · May 16, 2017, 3:20pm

Awesome! (and a big relief, that problem really kind of burst my bubble, so to speak)

It seems to me like the issue has to do with the function reference not knowing about the Function arguments or return types. Perhaps it takes it a lot of time to look up that data.

And a big +1 to putting FunctionWrappers in Base.

kristoffer.carlsson · May 16, 2017, 3:23pm

You can put whatever functions you want in your original vector. In the FunctionWrapper vector you are only allowed to put functions that accept an Int and returns a Float64. Therefore the arguments and return values are known and does not need to be boxed. Here, it also means the functions can’t get inlined but that won’t happen anyway.

cstjean · May 16, 2017, 6:42pm

Cool! I hope you’ll open source it. It would be fun for AI research.

ExpandingMan · May 16, 2017, 6:55pm

Yes, it already is open source but I won’t publicize the repo quite yet . Indeed, deep reinforcement learning is on my list as one of the fun things to possibly do with it, though I’ve been reading that those nets take absolutely forever to train.

Once I resolve the performance issues the CPU simulation will be pretty much done except for timing. I may take the time to write a simple assembler (in the form of a macro) for testing purposes.

After that I’ll have to emulate the graphics chip and actually render the thing for which I’ll use the various OpenGL bindings here. I expect that to be a pretty big job.

cstjean · May 16, 2017, 7:10pm

I don’t know about OpenGL, but FWIW I’ve been using Images.jl to manipulate and display NES images and videos, and it works very well. It’s optimized for performance.

yuyichao · May 16, 2017, 7:20pm

This can just be dict[1] = f1. There’s convert for this.

ExpandingMan · May 17, 2017, 12:11am

Thanks again so much everyone! It now takes me about the same amount of time to execute between 2 and 6 instructions as it used to take me to retrieve a single function from a vector (without even executing it)! I suspect I’m already within a factor of 2 of the theoretical maximum.

Topic		Replies	Views
FunctionWrappers showing ~25x overhead over a native function call General Usage question	10	1435	June 17, 2017
Array of functions - is there a way to avoid allocations performance penalty? Performance memory-allocation , arrays	23	2360	May 23, 2019
On the performance of function calls that depends on a variable Performance metaprogramming	12	953	February 24, 2021
Surprising runtime behaviour when wrapping functions Performance question	4	415	September 14, 2021
Performance of collections with Functions as abstract type Performance	6	311	August 18, 2022

Performance of functions returned from container

Related topics