Wouldnt u get it after step 3
For what itâs worth:
julia> using LoopVectorization, BenchmarkTools
julia> C = rand(100,100,100);
julia> D = similar(C);
julia> @btime @avx $D .= $C.^0.3;
5.376 ms (0 allocations: 0 bytes)
julia> @btime $D .= $C.^0.3;
13.440 ms (0 allocations: 0 bytes)
julia> vC = vec(C); vD = vec(D);
julia> @btime @avx $vD .= $vC.^0.3;
4.555 ms (0 allocations: 0 bytes)
This requires LoopVectorization version 0.3.3, which should merge shortly (there was a bug in earlier versions that Iâve fixed).
EDIT: If youâre willing to give up a little more accuracy:
julia> @btime @avx ($vD .= log.($vC); $vD .= exp.(0.3 .* $vD));
2.175 ms (0 allocations: 0 bytes)
julia> vD â vC .^ 0.3
true
Less accurate, but still isapprox
correct.
I just pull ] add LoopVectorization#master
and tried:
using BenchmarkTools, LoopVectorization
a = rand(100,100,100);
b = rand(100,100,1); # a and b are not of same size.
@btime @avx a .+ b; # Crashing...
Itâs crashing in case that two matrices are not the same size. Do I need to file a report?
Additionally I tried (on Windows 10, Intel i7 :
a = rand(100,100,100);
b = rand(100,100,100);
@btime a .+ b; # 2.241 ms (4 allocations: 7.63 MiB)
@btime @avx a .+ b; # 2.904 ms (4 allocations: 7.63 MiB)
@btime @strided a .+ b; # 1.116 ms (72 allocations: 7.64 MiB)
No, that behaviour is expected. LoopVecotrization is meant for ultra high performance, low safety operations. You need to verify correctness prior to sticking something in @avx
and even if stuff runs, you need to manually make sure itâs giving you correct results.
As for your timing against Strided, thatâs also expected. Vector addition is already pretty well vectorized, so you wonât see any great advantage from LoopVectorization, but Strided does multithreading, so thatâs consistent with ~2x performance increase you observed.
Oh, I see. Thanks!
Strided does multithreading, so thatâs consistent with ~2x performance increase you observed.
GitHub - Jutho/Strided.jl: A Julia package for strided array views and efficient manipulations thereof : here the example in the document shows Stride
is effective even in a single thread, which is contrary to my experiment. I wonder if you have any idea on this.
EDIT: sorry. I tried again and have found that @strided is effective even in single thread. I wonder how it could even be possible and why it is not default algorithm in Julia.
EDIT2: another observation: actually it depends on the sizes of matrices being broadcast. When A and B are same size, @stride
is effective in single thread too. But when the sizes are different, it is only effective in multi threads. Oh, I bothered you enoughâŚI need to stop.
using Strided, BenchmarkTools
A = randn(4000,4000);
B = rand(4000, 1)
@btime ($A .+ $B') ./ 2;
@btime @strided ($A .+ $B') ./ 2;
Note that had you used:
a = rand(100,100,100);
b = rand(100,100);
It would have worked.
I added a couple tests showing workarounds.
For now, you have to indicate missing dimensions with type information. On LoopVectorization 0.3.4:
using LoopVectorization, Test
T = Float64
a = rand(T,100,100,100);
b = rand(T,100,100,1);
bl = LowDimArray{(true,true,false)}(b);
br = reshape(b, (100,100));
c1 = a .+ b;
c2 = @avx a .+ bl;
@test c1 â c2
fill!(c2, 99999.9);
@avx c2 .= a .+ br;
@test c1 â c2
br = reshape(b, (100,1,100));
bl = LowDimArray{(true,false,true)}(br);
@. c1 = a + br;
fill!(c2, 99999.9);
@avx @. c2 = a + bl;
@test c1 â c2
br = reshape(b, (1,100,100));
bl = LowDimArray{(false,true,true)}(br)
@. c1 = a + br;
fill!(c2, 99999.9);
@avx @. c2 = a + bl;
@test c1 â c2
If I wanted to fix it automatically, one approach would be use a different stridedpointer
function when broadcasting that sets strides corresponding to dims of size 1
to zero, something like:
@inline function broadcaststridedpointer(A::DenseArray{T,N}) where {T,N}
stridesA = strides(A)
sizeA = size(A)
PackedStridedPointer(
pointer(A),
ntuple(n -> sizeA[n+1] == 1 ? 0 : stridesA[n+1], Val(N-1))
)
end
However, if the first axis is zero, this will not work, because it guarantees unit stride in the first dimension. Lifting that guarantee would require dispatches to gather
and scatter!
instead of vload
and vstore!
. The former are much slower, but I use them when types show it is necessary (eg, a SubArray
with an Int
as the first index instead of range or colon).
I also improve performance on this test case. It was unrolling by 5, but this always seems to cause bad performance. For now I capped it at 4, but perhaps I should disallow 3 as well.
Iâm still fiddling with tuning computational optimizations.
It does not consider memory optimizations yet, so there currently isnât much point to benchmarking / evaluating it at large memory problems (but it will still help for vectoring things like ^
, exp
, or log
.
Perhaps I should add a naive hack to make it at least favor putting loops in the correct order when the orders are computationally equivalent, to help until I implement a real model of memory access costs.
The +
example is easy enough to optimize for the compiler, and at those sizes it should be memory bound. Same story with â(A + B) / 2â.
Unsurprisingly, I see about the same performance:
julia> using LoopVectorization, BenchmarkTools, Strided
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]
julia> A = randn(4000,4000); B = rand(4000);
julia> @btime @avx ($A .+ $B') ./ 2;
26.284 ms (2 allocations: 122.07 MiB)
julia> @btime ($A .+ $B') ./ 2;
26.389 ms (2 allocations: 122.07 MiB)
julia> @btime @strided ($A .+ $B') ./ 2;
26.338 ms (6 allocations: 122.07 MiB)
julia> Threads.nthreads()
1
Although taking the first example from Stridedâs README:
julia> Threads.nthreads()
1
julia> B = similar(A);
julia> @btime @strided $B .= ($A .+ $A') ./ 2;
61.788 ms (18 allocations: 848 bytes)
julia> @btime @avx $B .= ($A .+ $A') ./ 2;
48.441 ms (0 allocations: 0 bytes)
julia> @btime $B .= ($A .+ $A') ./ 2;
67.328 ms (0 allocations: 0 bytes)
@avx
uses gather instructions to load from the transposed matrix, so this might require AVX512 to be fast.
If we have more threads though, Strided
obviously wins by a substantial margin:
julia> Threads.nthreads()
18
julia> @btime @strided $B .= ($A .+ $A') ./ 2;
7.965 ms (120 allocations: 16.47 KiB)
Simply inlining the iterate function and changing
c = @. y .- Qnext .* Bnext .+ B
c[c .<= 0] = 1e-14 .+ 0 * c[c .<= 0]
to
c = @. y - Qnext * Bnext + B |> x -> x > 0.0 ? x : 1e-14
gives a 2X speed increase for me.
Here is the code (I recreated the âdataâ to make it a minimal working example):
using Statistics
# Reproduce data:
mu = collect(range(-0.03, 1.03, length=256))
mu[1] = -0.0789; mu[end] = 1.0789
Py2 = zeros(256, 256)
x = range(0, 1, length=256)
h = vcat(0.679, ones(254)*0.0287, 0.679)
sigma = vcat(0.0908, ones(254)*0.0816, 0.0908)
for i=1:256
Py2[:, i] = h[i]*exp.(-((x.-mu[i])/sigma[i]).^2)
end
const Py = copy(Py2)
const logy_grid = range(-0.2293084801321751, 0.2293084801321751, length=256)
function main_original(nB, repeats)
β = .953
Îł = 2.
r = 0.017
θ = 0.282
ny = size(logy_grid, 1)
Bgrid = LinRange(-.45, .45, nB)
ygrid = exp.(logy_grid)
ymean = mean(ygrid) .+ 0 * ygrid
def_y = min(0.969 * ymean, ygrid)
Vd = zeros(ny, 1)
Vc = zeros(ny, nB)
V = zeros(ny, nB)
Q = ones(ny, nB) * .95
y = reshape(ygrid, (ny, 1, 1))
B = reshape(Bgrid, (1, nB, 1))
Bnext = reshape(Bgrid, (1, 1, nB))
zero_ind = Int(ceil(nB / 2))
function u(c, Îł)
return c.^(1 - Îł) / (1 - Îł)
end
t0 = time()
function iterate(V, Vc, Vd, Q)
EV = Py * V
EVd = Py * Vd
EVc = Py * Vc
Vd_target = u(def_y, γ) + β * (θ * EVc[:, zero_ind] + (1 - θ) * EVd[:])
Vd_target = reshape(Vd_target, (ny, 1))
Qnext = reshape(Q, (ny, 1, nB))
c = @. y .- Qnext .* Bnext .+ B
c[c .<= 0] = 1e-14 .+ 0 * c[c .<= 0]
EV = reshape(EV, (ny, 1, nB))
m = @. u(c, γ) .+ β * EV
Vc_target = reshape(maximum(m, dims=3), (ny, nB))
Vd_compat = Vd * ones(1, nB)
default_states = float(Vd_compat .> Vc)
default_prob = Py * default_states
Q_target = (1. .- default_prob) ./ (1 + r)
V_target = max.(Vc, Vd_compat)
return V_target, Vc_target, Vd_target, Q_target
end
iterate(V, Vc, Vd, Q) # warmup
t0 = time()
for iteration in 1:repeats
V, Vc, Vd, Q = iterate(V, Vc, Vd, Q)
end
t1 = time()
out = (t1 - t0) / repeats
end
function main_with_inlining(nB, repeats)#, logy_grid, Py)
β = .953
Îł = 2.
r = 0.017
θ = 0.282
ny = size(logy_grid, 1)
Bgrid = LinRange(-.45, .45, nB)
ygrid = exp.(logy_grid)
ymean = mean(ygrid) .+ 0 * ygrid
def_y = min(0.969 * ymean, ygrid)
Vd = zeros(ny, 1)
Vc = zeros(ny, nB)
V = zeros(ny, nB)
Q = ones(ny, nB) * .95
y = reshape(ygrid, (ny, 1, 1))
B = reshape(Bgrid, (1, nB, 1))
Bnext = reshape(Bgrid, (1, 1, nB))
zero_ind = Int(ceil(nB / 2))
function u(c, Îł)
return c.^(1 - Îł) / (1 - Îł)
end
t0 = time()
@inline function iterate(V, Vc, Vd, Q)
EV = Py * V
EVd = Py * Vd
EVc = Py * Vc
Vd_target = u(def_y, γ) + β * (θ * EVc[:, zero_ind] + (1 - θ) * EVd[:])
Vd_target = reshape(Vd_target, (ny, 1))
Qnext = reshape(Q, (ny, 1, nB))
# c = @. max(1e-14, y - Qnext * Bnext + B)
c = @. y - Qnext * Bnext + B |> x -> x > 0.0 ? x : 1e-14
EV = reshape(EV, (ny, 1, nB))
m = @. u(c, γ) + β * EV
Vc_target = reshape(maximum(m, dims=3), (ny, nB))
Vd_compat = Vd * ones(1, nB)
default_states = float(Vd_compat .> Vc)
default_prob = Py * default_states
Q_target = (1. .- default_prob) ./ (1 + r)
V_target = max.(Vc, Vd_compat)
return V_target, Vc_target, Vd_target, Q_target
end
iterate(V, Vc, Vd, Q) # warmup
t0 = time()
for iteration in 1:repeats
V, Vc, Vd, Q = iterate(V, Vc, Vd, Q)
end
t1 = time()
out = (t1 - t0) / repeats
end
n = 2
m = 151
println(1000 * main_original(m, n))
println(1000 * main_with_inlining(m, n))
n = 2
mm = [151,351,551,751,951,1151,1351,1551]
res = zeros(length(mm), 2)
for (i, m) in enumerate(mm)
res[i, 1] = main_original(m,n)
res[i, 2] = main_with_inlining(m,n)
display(Int.(round.(res*1000)))
end
Result (right column is with inlining):
189 105
814 390
1920 873
3611 1639
5760 2455
8327 3546
11313 4836
14961 6321
Disclaimer: this is getting very much off topic and should maybe be split to a separate thread.
I am actually surprised that Strided manages to speed up such a simple operation by involving more threads. I would have thought that this operation would be mostly memory bound.
Anyway, there are two settings in which Strided can speed up array operations:
- many computations per iteration, by using multithreading
- memory unfriendly access patterns (permutdims and friends), even without threads
However, the microkernel in Strided is not very well optimized, and would certainly benefit from the work of @elrod on exploiting vectorization, to yield even further speedups. Unfortunately I donât think I can just plug in a simple @avx
decoration in the current kernel, because at that point the operation is already disected into manually incrementing (linear) indices with appropriate strides. I am unfortunately not very familiar with the low level vector instructions and how to call them from within Julia.
Maybe some collaboration could be fruitful, @elrod?
After several unsuccessful optimization attempts, I noticed something.
In the âwarmupâ run compute_price(Spot, Ď, K, r)
is called on (Int, Float64, Int, Float64).
In dP_dS = (compute_price(Spot + Îľ, Ď, K, r) - P) / Îľ
, the arguments of compute_price
are (Float64, Float64, Int, Float64)
etc.
So, the time reported by the script still includes compilation time.
I tried changing Spot, Ď, K, r so they are all float, but still no speed up.
Any luck @Vasily_Pisarev?
Some speedup by const
ing all globals and replacing line 20 by
@. @views B[:,n] = 2 * x * B[:,n - 1] - B[:,n - 2]
But that still ends up slower than numpy.
From the CPU usage Iâd conclude that âidiomaticâ numpy is just better at passing the heavy lifting to BLAS than âidiomaticâ Julia.
I think a lot of it is that multithreading is only a few months old in Julia, so not much of Base takes advantage yet. If we could get broadcasting to use threads, that would be huge.
Abstracting from the specific code I would like to highlight the endogeneity problem here - generalising statments such as the premise of this topic is exactly the cause for such topics
Yes, there are (probably many) instances where Julia is faster than Matlab (or other languages) but the premise that it should âusuallyâ be faster is misleading (espeically in the context of this topic it appears as if it should âalwaysâ be faster). I state that because I was also misled in the sense that when I started learning Julia I was expecting all those gains in speed, even for relatively simple tasks exactly beacuse of the paper of Villaverde or the comparisons on the website.
If you did DSGE models with VFI as in the mentioned paper AND it came out slower than comparable languages it would indeed be interesting and puzzling. However, once you start doing other things it becomes much more complicated to predict which is faster.
I work in time series, especially bayesian time series econometrics and spend a lot of time waiting for my code. I tried switching to Julia only to find that it is slower to Matlab also in my instances. I then looked around and asked experienced Julia programmers (in person, but I think I also had some discussions here) for help and in my case it turned out it cannot get faster and it all depends on the BLAS in the background.
Remember also, that code might be slower for two reasons:
- the language is slower
- the code is not written well for the language.
No. 2 is BIG for me, I have over 13 years experience with Matlab and at best half a year in Julia. I am sure I write suboptimal Julia code because I am plagued with doing stuff necessery in Matlab but unneeded here and because I donât know what is the best way to do stuff in Julia.To compare 1) objectively you need to write the code in the best way possible for each language and account for the time of writing the code in that manner.
All I am saying is that it appears that you think that Julia should be faster per se, exactly as I did and I think we should watch out as this sets expectations for new users.
Porting from matlab to julia is the most difficult one, since Matlab has put in crazy effort to make programming patterns that are toxic to all other languages super fastâŚ
I usually say, that in almost all cases, Julia can get to machine performance like C and if it doesnât, it is a bug.
That of course doesnât talk about any effort needed to achieve this⌠Of course, if you have a language that is super slow, but which happens to offer a fast C library exactly tailored to your problem, your effort to get state of the art performance in that slow language will be 0, and non zero in any language that doesnât have the same library handy
The only valid point about Julia in that context is, that given enough time and effort, it should (almost) always be possible to reach the best performance.
I actually think that the premise that Julia should âusuallyâ be faster is entirely accurate (after accounting for compilation time). It just so happens that the case of everything being reduced to BLAS calls is one of a handful of cases where âusuallyâ doesnât apply. The other cases are usually when youâre doing something sufficiently easy that Numpy, Numba, Matlab JIT, etc. can optimize their respective languagesâ code to close to C performance as well. These cases do happen, but I wouldnât say theyâre âusual.â Then again, thereâs a certain selection bias. Many people using Julia are using Julia precisely because their problem is too hard for Python or Matlab to optimize effectively, and many people not using Julia stayed with Python or Matlab because that wasnât the case for them.
(Also, recent developments have convinced me that itâs quite possible that someday there will be a Julia-based BLAS competitive with MKL, so even the BLAS case may get better!)
In that case can you help figure out why the option pricing example is slower in Julia than Matlab/Numpy/etc. @inline doesnât seem to help.
Re:https://github.com/vduarte/benchmarkingML/blob/master/LSMC/Julia.jl
const Spot = 36
const Ď = 0.2
const n = 100000
const m = 10
const K = 40
const r = 0.06
const T = 1
const order = 25
const Ît = T / m
@StefanKarpinski If I make the above global variables const
in the original code it runs more slowly.
@Albert_Zevelev If I make two versions of your code with and without const
variables (as above) the version with the const
globals is slower most of the time. However, your code is significantly faster than the original. (I agree with your commented out lines).
This the reason I donât really like the way people market julia. The way people talk about juliaâs performance gives many people the impression that they can walk over, write their (sometimes horrific) matlab / python code in julia (while expecting minimal syntax / language differences) and see order of magnitude performance improvements.
Sometimes naĂŻvely written julia is crazy fast. Sometimes itâs not. The difference is that in julia you actually can buckle down, look at a performance hotspot and tune it to C speeds without ever leaving julia.
High performance julia code can often be completely unrecognizable as âjuliaâ code and might be more reminiscent of Fortran or C++ than Python (though, Iâd argue that itâs easier to learn to write high performance julia code if you already know julia than it is to learn to write Fortran or C++).
However the situation is still better than stuff like Python + Cython or Matlab + calling out Fortran routines. Even though some âhigh performanceâ julia code is unrecognizable to humans, the compiler still sees it as just julia code. Having all code be seen by the same compiler enables certain optimizations (interprocedural optimizations) that arenât possible in other paradigms, and more importantly, it enables much more generic code and allows people make amazing tools like Cassette.jl, or language wide automatic differentiation and have them just work on practically any package.