Here is the graph of the original results with log Y scale, for better viewing.
I would imagine that the one measurement of
@time isn’t always the best way to benchmark things, even if it is the second run.
using BenchmarkTools julia> @btime C.^0.3 9.425 s (2 allocations: 3.48 GiB)
I think they need to be compared on the same machine.
I tried with smaller array and Julia version still take twice as longer (I used @btime for measuring):
C = np.random.rand(100,100,100) %%timeit C**0.3 29.2 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
C = rand(100,100,100) @btime C.^0.3 67.824 ms (4 allocations: 7.63 MiB)
Could you try
I have no idea why you are getting this. See what I got in my machine:
In : import numpy as np In : C = np.random.rand(100,100,100) In : %%timeit ...: C**0.3 ...: ...: 17.5 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
julia> using BenchmarkTools julia> C = rand(100,100,100); julia> @btime C.^0.3; 17.387 ms (4 allocations: 7.63 MiB)
Here it is:
julia> @btime $C.^0.3; 74.537 ms (2 allocations: 7.63 MiB)
Really strange… I am getting the same result at office and at home.
what exactly does that mean?
Vd_target = u(def_y, γ) + β * (θ * EVc[:, zero_ind] + (1 - θ) * EVd[:]) #calling a[:] allocates a new array
how can that be fixed?
Actually, I see a similar performance difference as @Sijun, about a factor of 2 between C and Julia. Strange. Is this perhaps some MKL/LIBM or multithreading issue (just guessing)?
In the paper linked in the original post, they compare a few different GPU implementations but don’t mention that Julia can run in GPUs. Is there anyone around with a GPU that wants to show off CuArrays.jl? It seems like a good fit for this problem.
I don’t have an MKL build of Julia.
Many Python distributions use MKL by default. Having VML hooked up to Python would explain these differences I think.
Numpy and Julia versions run at close to the same speed on both my Macbook and a Ubuntu workstation (default Julia 1.3 vs miniconda3 /w MKL)
Should have said it before: I’m on Windows 10 on this machine.
I made some modifications to Tim Holy’s PR and found that by using Strided.jl, I was able to beat Python/Numpy and Matlab for all sizes other than 151. Here are my timings on the same model macbook as the one the authors used:
151: 326.1284828186035 351: 308.24360847473145 551: 690.5834913253784 751: 1231.9454908370972 951: 1912.0723962783813 1151: 5935.046696662903 1351: 18267.05288887024 1551: 29274.00109767914
I believe this suggests that the difference was that Python and Matlab were multithreading the broadcast operations across the two available threads whereas julia does not multi-thread broadcast unless you use something like Strided.jl.
The second time it’s run, the timings improve. I’m not sure why exactly that is, it seems like all the JIT overhead should have been hit before the benchmarking loop in the first function call, but nonetheless here is the second run timing:
julia> include("julia.jl") 151: 89.70789909362793 351: 288.47689628601074 551: 686.4475965499878 751: 1315.5532121658325 951: 2086.0692977905273 1151: 5694.838190078735 1351: 18829.65850830078 1551: 31184.902906417847
Publish soon? This is already published in JEDC https://www.sciencedirect.com/science/article/pii/S0165188919301939 .
The author’s website says “Conditionally Accepted”
Apparently not conditional on checking with the Julia community first
Suddenly I know how to get optimized programs for free…
Step 1: Code up programs for my paper in Matlab
Step 2: Write a naive version in Julia
Step 3: Post on Julia Discourse
Step 4: …
Step 5: Publish!