Julia slower than Matlab & Python? No

A recent paper ML Software & Hardware for Econ finds Julia is often slower than Matlab & Numpy.
Code for Option Pricing is here.
Code for Dynamic Programming is here.

Usually, writing Julia code similarly leads to enormous speed-up.
How can we tweak their code for a more accurate comparison???

Update:
Dynamic Programming: Stefan, Tim, Mason tweaked the code & Julia is faster than Matlab/Python/R. Julia is faster than C++ in all but the two smallest grids.

Option Pricing: if you replace the awkward vectorized parts of the code, w/ simpler more intuitive loops, the Julia code easily outperforms the others (except TF/PyTorch which I don’t have right now).

In the paper they also run PyTorch & TensorFlow on GPU.
@Mason asks if anyone can speed this up w/ Julia on a GPU as well??? (I think this would be awesome bc Julia is known to be great on GPU…)

4 Likes

Can this be correct???

It’s possible to write arbitrarily slow code in any language if you write the code in a sufficiently preserve way. In fact, it’s much easier to write slow code than fast code.

17 Likes

Yeah but it looks like they tried to write the code similarly across languages.

Usually, writing Julia code similarly leads to enormous speed-up.
There must be something wrong. It’d be great to let them know before they publish it soon.

1 Like

The code, with some comments of my part:

using DelimitedFiles, Statistics

# Load grid for log(y) and transition matrix
logy_grid = readdlm("logy_grid.txt")[:] 
Py = readdlm("P.txt")
#those 2 variables should be const

function main(nB=351, repeats=500)
    β = .953
    γ = 2.
    r = 0.017
    θ = 0.282
    ny = size(logy_grid, 1)

    Bgrid = LinRange(-.45, .45, nB)
    ygrid = exp.(logy_grid)

    ymean = mean(ygrid) .+ 0 * ygrid
    def_y = min(0.969 * ymean, ygrid)

    Vd = zeros(ny, 1)
    Vc = zeros(ny, nB)
    V = zeros(ny, nB)
    Q = ones(ny, nB) * .95

    y = reshape(ygrid, (ny, 1, 1))
    B = reshape(Bgrid, (1, nB, 1))
    Bnext = reshape(Bgrid, (1, 1, nB))

    zero_ind = Int(ceil(nB / 2))

    function u(c, γ) #add the broadcast in the call? 
        return c.^(1 - γ) / (1 - γ)
    end


    t0 = time()
    function iterate(V, Vc, Vd, Q)
        EV = Py * V #array created, it should be preallocated
        EVd = Py * Vd#array created
        EVc = Py * Vc#array created

        Vd_target = u(def_y, γ) + β * (θ * EVc[:, zero_ind] + (1 - θ) * EVd[:])
#calling a[:] allocates a new array
        Vd_target = reshape(Vd_target, (ny, 1))

        Qnext = reshape(Q, (ny, 1, nB))

        c = @. y .- Qnext .* Bnext .+ B #array created
        c[c .<= 0] = 1e-14 .+ 0 * c[c .<= 0]
        EV = reshape(EV, (ny, 1, nB))
        m =  @. u(c, γ) .+ β * EV #array created
        Vc_target = reshape(maximum(m, dims=3), (ny, nB))

        Vd_compat = Vd * ones(1, nB) #array created
        default_states = float(Vd_compat .> Vc)
        default_prob = Py * default_states #array created
        Q_target = (1. .- default_prob) ./ (1 + r)

        V_target = max.(Vc, Vd_compat) #array created

        return V_target, Vc_target, Vd_target, Q_target

    end

    iterate(V, Vc, Vd, Q)  # warmup
    t0 = time()
    for iteration in 1:repeats
        V, Vc, Vd, Q = iterate(V, Vc, Vd, Q)
    end
    t1 = time()
    out = (t1 - t0) / repeats

end

print(1000 * main(1351, 10))

In short, a great speedup can be achieved with just changing some calls with implace versions and preallocating the arrays before the iterate definition

8 Likes

It looks like this code is mostly matrix and vector operations, which explains how they have a PyTorch version. It’s not clear that Julia should be any better at this, especially since Matlab uses multithreading by default and Julia doesn’t except for BLAS operations.

I see. But does that help explain why Julia is outperformed by regular Python/numpy?

numpy is basically C, if the overhead of the python interperater is minimal, numpy should be as fast as C since it’s only ~function call slower

1 Like

A cursory look through the code also suggests that they are using 1 in a bunch of places where 1.0 would be better. I’m not sure how much performance difference it makes here though.

I don’t care to figure out exactly what they’re trying to do so I apologize for the half-hearted answer, but a lot of stuff there smells bad. This one in particular just seems poorly planned.

default_states = float(Vd_compat .> Vc)

Also, similar syntax is a pretty bad standard for equating similar levels of effort for each set of code. They should seriously think about using the number of lines of code as a better standard. Similar syntax only ensures easy reading and implementation for those that use that type of syntax.

P.S.
Their R code is also less than spectacular but why bother optimizing R code?

2 Likes

Y’all are missing the big one and first thing to check: the code uses non-constant globals.

https://github.com/vduarte/benchmarkingML/pull/1

More speedup is certainly possible, but this is the big performance killer.

20 Likes

I actually tried to make them global and saw no significant improvement.

Do you mean const?

yes, exactly as your pr,

10975.95670223236% # with const                                                                                                                                    

11052.845978736877%   
2 Likes

Strange.

They find similar results for option pricing:
https://github.com/vduarte/benchmarkingML/tree/master/LSMC
I’m not sure if this also applies

1 Like

Profiler shows that most of time is spent on broadcasting

c = @. y .- Qnext .* Bnext .+ B
and
m = @. u(c, γ) .+ β * EV

Maybe Julia’s broadcasting has more room for improvement?

1 Like

Why the double broadcasting? Both . and @., I mean.

I tried with and without it. Without @. took about 10% more time.

You should just use @., then, not both.

There seems to be some performance difference in broadcasting operation between Julia and numpy.

C = np.random.rand(256, 1351, 1351)

%%timeit
C**0.3

15.3 s ± 805 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compare to:

const C=rand(256, 1351, 1351)

@time C.^0.3  # second time calling

34.271383 seconds (8 allocations: 3.481 GiB, 0.01% gc time)