Julia slower than Matlab & Python? No

Albert_Zevelev · January 8, 2020, 11:57pm

A recent paper ML Software & Hardware for Econ finds Julia is often slower than Matlab & Numpy.
Code for Option Pricing is here.
Code for Dynamic Programming is here.

Usually, writing Julia code similarly leads to enormous speed-up.
How can we tweak their code for a more accurate comparison???

Update:
Dynamic Programming: Stefan, Tim, Mason tweaked the code & Julia is faster than Matlab/Python/R. Julia is faster than C++ in all but the two smallest grids.

Option Pricing: if you replace the awkward vectorized parts of the code, w/ simpler more intuitive loops, the Julia code easily outperforms the others (except TF/PyTorch which I don’t have right now).

In the paper they also run PyTorch & TensorFlow on GPU.
@Mason asks if anyone can speed this up w/ Julia on a GPU as well??? (I think this would be awesome bc Julia is known to be great on GPU…)

Mason · January 9, 2020, 12:03am

Can this be correct???

It’s possible to write arbitrarily slow code in any language if you write the code in a sufficiently preserve way. In fact, it’s much easier to write slow code than fast code.

Albert_Zevelev · January 9, 2020, 12:07am

Yeah but it looks like they tried to write the code similarly across languages.

Usually, writing Julia code similarly leads to enormous speed-up.
There must be something wrong. It’d be great to let them know before they publish it soon.

longemen3000 · January 9, 2020, 12:21am

The code, with some comments of my part:

using DelimitedFiles, Statistics

# Load grid for log(y) and transition matrix
logy_grid = readdlm("logy_grid.txt")[:] 
Py = readdlm("P.txt")
#those 2 variables should be const

function main(nB=351, repeats=500)
    β = .953
    γ = 2.
    r = 0.017
    θ = 0.282
    ny = size(logy_grid, 1)

    Bgrid = LinRange(-.45, .45, nB)
    ygrid = exp.(logy_grid)

    ymean = mean(ygrid) .+ 0 * ygrid
    def_y = min(0.969 * ymean, ygrid)

    Vd = zeros(ny, 1)
    Vc = zeros(ny, nB)
    V = zeros(ny, nB)
    Q = ones(ny, nB) * .95

    y = reshape(ygrid, (ny, 1, 1))
    B = reshape(Bgrid, (1, nB, 1))
    Bnext = reshape(Bgrid, (1, 1, nB))

    zero_ind = Int(ceil(nB / 2))

    function u(c, γ) #add the broadcast in the call? 
        return c.^(1 - γ) / (1 - γ)
    end


    t0 = time()
    function iterate(V, Vc, Vd, Q)
        EV = Py * V #array created, it should be preallocated
        EVd = Py * Vd#array created
        EVc = Py * Vc#array created

        Vd_target = u(def_y, γ) + β * (θ * EVc[:, zero_ind] + (1 - θ) * EVd[:])
#calling a[:] allocates a new array
        Vd_target = reshape(Vd_target, (ny, 1))

        Qnext = reshape(Q, (ny, 1, nB))

        c = @. y .- Qnext .* Bnext .+ B #array created
        c[c .<= 0] = 1e-14 .+ 0 * c[c .<= 0]
        EV = reshape(EV, (ny, 1, nB))
        m =  @. u(c, γ) .+ β * EV #array created
        Vc_target = reshape(maximum(m, dims=3), (ny, nB))

        Vd_compat = Vd * ones(1, nB) #array created
        default_states = float(Vd_compat .> Vc)
        default_prob = Py * default_states #array created
        Q_target = (1. .- default_prob) ./ (1 + r)

        V_target = max.(Vc, Vd_compat) #array created

        return V_target, Vc_target, Vd_target, Q_target

    end

    iterate(V, Vc, Vd, Q)  # warmup
    t0 = time()
    for iteration in 1:repeats
        V, Vc, Vd, Q = iterate(V, Vc, Vd, Q)
    end
    t1 = time()
    out = (t1 - t0) / repeats

end

print(1000 * main(1351, 10))

In short, a great speedup can be achieved with just changing some calls with implace versions and preallocating the arrays before the iterate definition

aaowens · January 9, 2020, 12:28am

It looks like this code is mostly matrix and vector operations, which explains how they have a PyTorch version. It’s not clear that Julia should be any better at this, especially since Matlab uses multithreading by default and Julia doesn’t except for BLAS operations.

Albert_Zevelev · January 9, 2020, 12:31am

I see. But does that help explain why Julia is outperformed by regular Python/numpy?

jling · January 9, 2020, 12:32am

numpy is basically C, if the overhead of the python interperater is minimal, numpy should be as fast as C since it’s only ~function call slower

Oscar_Smith · January 9, 2020, 12:42am

A cursory look through the code also suggests that they are using 1 in a bunch of places where 1.0 would be better. I’m not sure how much performance difference it makes here though.

Zach_Christensen · January 9, 2020, 2:55am

I don’t care to figure out exactly what they’re trying to do so I apologize for the half-hearted answer, but a lot of stuff there smells bad. This one in particular just seems poorly planned.

default_states = float(Vd_compat .> Vc)

Also, similar syntax is a pretty bad standard for equating similar levels of effort for each set of code. They should seriously think about using the number of lines of code as a better standard. Similar syntax only ensures easy reading and implementation for those that use that type of syntax.

P.S.
Their R code is also less than spectacular but why bother optimizing R code?

StefanKarpinski · January 9, 2020, 3:45am

Y’all are missing the big one and first thing to check: the code uses non-constant globals.

https://github.com/vduarte/benchmarkingML/pull/1

More speedup is certainly possible, but this is the big performance killer.

jling · January 9, 2020, 4:35am

I actually tried to make them global and saw no significant improvement.

StefanKarpinski · January 9, 2020, 5:11am

Do you mean const?

jling · January 9, 2020, 5:14am

yes, exactly as your pr,

10975.95670223236% # with const                                                                                                                                    

11052.845978736877%

StefanKarpinski · January 9, 2020, 5:16am

Strange.

Albert_Zevelev · January 9, 2020, 5:17am

They find similar results for option pricing:
https://github.com/vduarte/benchmarkingML/tree/master/LSMC
I’m not sure if this also applies

Sijun · January 9, 2020, 6:23am

Profiler shows that most of time is spent on broadcasting

c = @. y .- Qnext .* Bnext .+ B
and
m = @. u(c, γ) .+ β * EV

Maybe Julia’s broadcasting has more room for improvement?

DNF · January 9, 2020, 6:25am

Why the double broadcasting? Both . and @., I mean.

Sijun · January 9, 2020, 6:27am

I tried with and without it. Without @. took about 10% more time.

DNF · January 9, 2020, 7:20am

You should just use @., then, not both.

Sijun · January 9, 2020, 8:19am

There seems to be some performance difference in broadcasting operation between Julia and numpy.

C = np.random.rand(256, 1351, 1351)

%%timeit
C**0.3

15.3 s ± 805 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compare to:

const C=rand(256, 1351, 1351)

@time C.^0.3  # second time calling

34.271383 seconds (8 allocations: 3.481 GiB, 0.01% gc time)

Topic		Replies	Views
Benchmark MATLAB & Julia for Matrix Operations Performance	148	19797	October 15, 2019
Porting code from MatLab - performance tips New to Julia	18	425	June 26, 2024
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1895	June 25, 2023
A Comparison of Programming Languages in Economics Community	62	6214	August 7, 2018
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8398	March 5, 2017

Julia slower than Matlab & Python? No

Related topics