Performance of functions returned from container


#1

I’m in the process of writing an NES emulator. Many of the steps of this are surprisingly performance critical, at least if you do a full simulation of the CPU, which is of course the easy way (on a semi-modern, cheap laptop CPU you only have about 1000 cycles of x86 to do 1 cycle of 6502, a lot of stuff has to happen in those 1000 cycles!). Anyway, my plan was to do this just as I would write any Julia program, assuming there wouldn’t be big performance issues resulting from multiple-dispatch, generics and the like. For the most part this has gone impressively well… until the very last step.

One of the critical things that needs to happen is that instructions (in the case of an NES these are UInt8s) need to get looked up. The only reasonable way I can think of doing this is to have a Vector or Dict of function references. However, the performance I’m getting from such functions is disappointing (likely prohibitively so).

See the following:

using BenchmarkTools

const dict = Vector{Function}(4)
# const dict = Dict{Int,Function}();  sizehint!(dict, 4)

f1(x::Int) = 1 + x
dict[1] = f1

f2(x::Int) = 1 - x
dict[2] = f2

f3(x::Int) = 1.0/x
dict[3] = f3

f4(x::Int) = x << 1
dict[4] = f4

const cdict = Int[1, 2, 3, 4]
# const cdict = Dict{Int,Int}(i=>i for i ∈ 1:4)

ref = @benchmarkable begin
    a1, a2 = cdict[1], cdict[2]
    f1(a1), f2(a2)
end

test = @benchmarkable begin
    dict[1](1), dict[2](2)
end

And the results:

julia> run(ref)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.000 ns (0.00% GC)
  median time:      25.000 ns (0.00% GC)
  mean time:        25.082 ns (0.00% GC)
  maximum time:     347.000 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> run(test)
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     218.000 ns (0.00% GC)
  median time:      231.000 ns (0.00% GC)
  mean time:        558.054 ns (0.00% GC)
  maximum time:     3.230 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

On the bright side the Dict lookups are extremely fast (it’s not much worse than using Vectors). I realize that ref isn’t quite equivalent to test, but I’d think they should be damn close, so why do I get 10 times worse performance? I realize there is likely to be some overhead in the compiler having to skip around between these functions, but I was not expecting it to be a factor of 10. This may render my little emulator project rather impossible.

So, some questions to those in-the-know:

  • Why?
  • Is there any way around this?
  • Is this a limitation of Julia or is there some fundamental reason why this is happening, so that I’d get similarly bad performance when doing something like this in any language?

Thanks ahead of time!


#3
import FunctionWrappers: FunctionWrapper

using BenchmarkTools
const F64IntFunc = FunctionWrapper{Float64,Tuple{Int}}

const dict = Vector{F64IntFunc}(4);

const cdict = Int[1, 2, 3, 4]

f1(x::Int) = 1 + x
dict[1] = F64IntFunc(f1);

f2(x::Int) = 1 - x
dict[2] = F64IntFunc(f2);

f3(x::Int) = 1.0/x
dict[3] = F64IntFunc(f3);

f4(x::Int) = x << 1
dict[4] = F64IntFunc(f4);


ref = @benchmarkable begin
    a1, a2 = cdict[1], cdict[2]
    f1(a1), f2(a2)
end

test = @benchmarkable begin
    dict[1](1), dict[2](2)
end

run(ref)
run(test)
julia> run(ref)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.000 ns (0.00% GC)
  median time:      30.000 ns (0.00% GC)
  mean time:        30.841 ns (0.00% GC)
  maximum time:     11.913 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> run(test)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.000 ns (0.00% GC)
  median time:      37.000 ns (0.00% GC)
  mean time:        40.765 ns (0.00% GC)
  maximum time:     11.749 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Ran on 0.5. Seems to be some ambiguity problems on 0.6 but should be easily fixed in the package.


FunctionWrappers showing ~25x overhead over a native function call
#4

Awesome! (and a big relief, that problem really kind of burst my bubble, so to speak)

It seems to me like the issue has to do with the function reference not knowing about the Function arguments or return types. Perhaps it takes it a lot of time to look up that data.

And a big +1 to putting FunctionWrappers in Base.


#5

You can put whatever functions you want in your original vector. In the FunctionWrapper vector you are only allowed to put functions that accept an Int and returns a Float64. Therefore the arguments and return values are known and does not need to be boxed. Here, it also means the functions can’t get inlined but that won’t happen anyway.


#6

Cool! I hope you’ll open source it. It would be fun for AI research.


#7

Yes, it already is open source but I won’t publicize the repo quite yet :wink:. Indeed, deep reinforcement learning is on my list as one of the fun things to possibly do with it, though I’ve been reading that those nets take absolutely forever to train.

Once I resolve the performance issues the CPU simulation will be pretty much done except for timing. I may take the time to write a simple assembler (in the form of a macro) for testing purposes.

After that I’ll have to emulate the graphics chip and actually render the thing for which I’ll use the various OpenGL bindings here. I expect that to be a pretty big job.


#8

I don’t know about OpenGL, but FWIW I’ve been using Images.jl to manipulate and display NES images and videos, and it works very well. It’s optimized for performance.


#9

This can just be dict[1] = f1. There’s convert for this.


#10

Thanks again so much everyone! It now takes me about the same amount of time to execute between 2 and 6 instructions as it used to take me to retrieve a single function from a vector (without even executing it)! I suspect I’m already within a factor of 2 of the theoretical maximum.