I’m in the process of writing an NES emulator. Many of the steps of this are surprisingly performance critical, at least if you do a full simulation of the CPU, which is of course the easy way (on a semi-modern, cheap laptop CPU you only have about 1000 cycles of x86 to do 1 cycle of 6502, a lot of stuff has to happen in those 1000 cycles!). Anyway, my plan was to do this just as I would write any Julia program, assuming there wouldn’t be big performance issues resulting from multiple-dispatch, generics and the like. For the most part this has gone impressively well… until the very last step.
One of the critical things that needs to happen is that instructions (in the case of an NES these are UInt8
s) need to get looked up. The only reasonable way I can think of doing this is to have a Vector
or Dict
of function references. However, the performance I’m getting from such functions is disappointing (likely prohibitively so).
See the following:
using BenchmarkTools
const dict = Vector{Function}(4)
# const dict = Dict{Int,Function}(); sizehint!(dict, 4)
f1(x::Int) = 1 + x
dict[1] = f1
f2(x::Int) = 1 - x
dict[2] = f2
f3(x::Int) = 1.0/x
dict[3] = f3
f4(x::Int) = x << 1
dict[4] = f4
const cdict = Int[1, 2, 3, 4]
# const cdict = Dict{Int,Int}(i=>i for i ∈ 1:4)
ref = @benchmarkable begin
a1, a2 = cdict[1], cdict[2]
f1(a1), f2(a2)
end
test = @benchmarkable begin
dict[1](1), dict[2](2)
end
And the results:
julia> run(ref)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.000 ns (0.00% GC)
median time: 25.000 ns (0.00% GC)
mean time: 25.082 ns (0.00% GC)
maximum time: 347.000 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> run(test)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 1
--------------
minimum time: 218.000 ns (0.00% GC)
median time: 231.000 ns (0.00% GC)
mean time: 558.054 ns (0.00% GC)
maximum time: 3.230 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
On the bright side the Dict
lookups are extremely fast (it’s not much worse than using Vector
s). I realize that ref
isn’t quite equivalent to test
, but I’d think they should be damn close, so why do I get 10 times worse performance? I realize there is likely to be some overhead in the compiler having to skip around between these functions, but I was not expecting it to be a factor of 10. This may render my little emulator project rather impossible.
So, some questions to those in-the-know:
- Why?
- Is there any way around this?
- Is this a limitation of Julia or is there some fundamental reason why this is happening, so that I’d get similarly bad performance when doing something like this in any language?
Thanks ahead of time!