Relative performance discrepancy across cpus even with --cpu-target set

non-Jedi · August 13, 2019, 5:09pm

I’ve been trying to optimize the benchmarks game’s julia programs again, and I’m running into a problem I don’t really know how to work around.

The gist is that between two implementations, one is significantly faster on the benchmarks game cpu from 2007, and one is significantly faster on my more modern laptop processor. This discrepancy persists even when I set --cpu-target=core2. So that AVX instructions aren’t used.

Any thoughts?

JeffreySarnoff · August 14, 2019, 3:17am

Enough time elapsed that many small aspects of the design, organization and optimization of cache levels , speculative execution, memory chemistry, threads and active cpu subblock interaction changed. Even small changes inside our computers combine biggly.

You are likely better off going with the implementation that runs fastest on your current laptop. You could also post both and ask if others would take a look.

freemint · August 14, 2019, 9:01am

My two cents:

Old CPU has way more cache
If your code uses multiple cores that can actually hurt your performance on the newer one, since it is an dual core with two threads per core. If two threads need the same hardware resources they are blocking each other some times

Also interesting is this comparison UserBenchmark: Intel Core i5-5300U vs Core2 Quad Q6600 where the younger CPU absolutely dominates, expect for Integer performance where it is the other way around (see under nice to have).

non-Jedi · August 14, 2019, 12:07pm

So I guess the real question I was trying to get at was: how do I optimize a Julia program for a processor I don’t have access to that has very different performance characteristics than ones I do? I’m not sure there’s actually any sort of good answer to that. Both implementations are in the linked thread in the OP, but they are far too much code to post inline on the forums.

The benchmark in question is pure single-core, double-precision floating point arithmetic. The only parallelism that is feasible with the benchmark is simd.

non-Jedi · August 14, 2019, 12:40pm

If anyone is curious, here are the portions of each script where the majority of the time is spent. Each of these functions is run on a vector of 5 Bodys in a loop something like:

@inbounds for i in 1:length(bodies), j in i+1:length(bodies)
    advance_velocity!(bodies[i], bodies[j], 0.01)
end

Faster on old cpu

This was slightly modified to extract the contents of an inner loop into a discrete function, but doing so causes no performance difference on the new cpu.

mutable struct Body
    x::Float64
    y::Float64
    z::Float64
    vx::Float64
    vy::Float64
    vz::Float64
    mass::Float64
end

function advance_velocity!(bi::Body, bj::Body, dt::Float64)
    dx = bi.x - bj.x
    dy = bi.y - bj.y
    dz = bi.z - bj.z
    dsq = dx^2 + dy^2 + dz^2
    distance = sqrt(dsq)
    mag = dt / (dsq * distance)

    bi.vx -= dx * bj.mass * mag
    bi.vy -= dy * bj.mass * mag
    bi.vz -= dz * bj.mass * mag

    bj.vx += dx * bi.mass * mag
    bj.vy += dy * bi.mass * mag
    bj.vz += dz * bi.mass * mag
end

Faster on new cpu

WARNING: this code commits type-piracy on tuples and should not be run in a REPL session you would like to preserve.

# 4 floats in tuple instead of 3 generates better SIMD instructions
const V3d = NTuple{4,Float64}
V3d(x=0.0, y=0.0, z=0.0) = (Float64(x), Float64(y), Float64(z), 0.0)

Base.sum(v::V3d) = @inbounds +(v[1], v[2], v[3])

struct Body
    pos::V3d
    v::V3d
    m::Float64
end

Base.@propagate_inbounds function update_velocity(b1, b2, Δt)
    Δpos = b1.pos .- b2.pos
    d² = sum(Δpos .* Δpos)
    mag = Δt / (d² * √d²)

    (Body(b1.pos, muladd.(-b2.m * mag, Δpos, b1.v), b1.m),
     Body(b2.pos, muladd.(b1.m * mag, Δpos, b2.v), b2.m))
end

And for the record, on the new cpu, modifying things between using a mutable Body and an immutable Body doesn’t make any appreciable difference in timing.

pixel27 · August 14, 2019, 8:02pm

I think this is pretty much impossible at least at the level you are trying for. Meaning it looks like you are trying to optimize a single operation to run as fast as possible. If you where optimizing a large application then it profile the whole thing, find the bottle necks, improve them, rinse and repeat…and the end if it runs faster on “your” cpu then it should run at least better on other CPUs.

But trying to optimize a tight repeated calculations, that’s going to depend on (as others have said) CPU cache and internal CPU architecture. The best you could probably do look at the generated machine code and ensure that it’s using the assembly you expect and has minimal branching.