I see @simd
can be used to vectorize for
loops. How about for broadcast()
?
- Will
broadcast(f,A)
translate to SIMD instructions off
operating on the elementsA
? - Will multiple instances of
f
be sent to multiple threads?
I see @simd
can be used to vectorize for
loops. How about for broadcast()
?
broadcast(f,A)
translate to SIMD instructions of f
operating on the elements A
?f
be sent to multiple threads?@simd
. That annotation is only required in cases where SIMD breaks some assumptions about how the computation is performed, in particular for floating point operations.So broadcast
will use SIMD when the compiler is able to make that optimization. I don’t think @simd
can be specified for broadcast
currently, though. To check whether SIMD is used, just look for vector
annotations in the output of @code_llvm
, or for SIMD instructions in the output of @code_native
. For example:
f(X, Y) = broadcast(*, X, Y)
X = rand(Int, 1000);
Y = rand(Int, 1000);
@code_llvm f(X, Y)
@code_native f(X, Y)
Would you mind showing an example of what this looks like? I cannot find any vector
annotations in my @code_llvm
output (using -O3
). Nor do I know what SIMD instructions in @code_native
look like.
For example, should this function yield SIMD instructions?
function square!(c::Array, x::Array)
@assert size(c) == size(x)
@simd for i in eachindex(x)
@inbounds c[i] = x[i] * x[i]
end
end
What do these instructions look like?
Are you using a downloaded binary of Julia, or did you compile it yourself? This makes a difference as to the type of machine operations that are used.
Thanks for the information. Very helpful. To go a bit further down the rabbit hole, is this the information regarding Julia’s SIMD decision making? Or does Julia do anything beyond LLVM auto-vectorization?
@DNF I get the output below with the Julia 0.6 I compiled from source (for a Skylake CPU). But with the generic binaries no vectorization happens, which means I guess that only recent instructions can give a performance improvement in that case (and they are not enabled on generic builds). That’s still surprising since I would have expected at least that SSE/SSE2 instructions could be used, as they are available on all x86_64 CPUs…
julia> function square!(c::Array, x::Array)
@assert size(c) == size(x)
@simd for i in eachindex(x)
@inbounds c[i] = x[i] * x[i]
end
end
square! (generic function with 1 method)
julia> X = rand(Int, 1000);
julia> Y = rand(Int, 1000);
julia> @code_llvm square!(X, Y)
define void @"julia_square!_68142"(i8**, i8**) #0 !dbg !5 {
...
min.iters.checked: ; preds = %if4.lr.ph.us11
%n.vec = and i64 %36, -16
%cmp.zero = icmp eq i64 %n.vec, 0
br i1 %cmp.zero, label %scalar.ph, label %vector.ph
vector.ph: ; preds = %min.iters.checked
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%42 = getelementptr i64, i64* %40, i64 %index
%43 = bitcast i64* %42 to <4 x i64>*
%wide.load = load <4 x i64>, <4 x i64>* %43, align 8
%44 = getelementptr i64, i64* %42, i64 4
%45 = bitcast i64* %44 to <4 x i64>*
%wide.load20 = load <4 x i64>, <4 x i64>* %45, align 8
%46 = getelementptr i64, i64* %42, i64 8
%47 = bitcast i64* %46 to <4 x i64>*
%wide.load21 = load <4 x i64>, <4 x i64>* %47, align 8
%48 = getelementptr i64, i64* %42, i64 12
%49 = bitcast i64* %48 to <4 x i64>*
%wide.load22 = load <4 x i64>, <4 x i64>* %49, align 8
%50 = mul <4 x i64> %wide.load, %wide.load
%51 = mul <4 x i64> %wide.load20, %wide.load20
%52 = mul <4 x i64> %wide.load21, %wide.load21
%53 = mul <4 x i64> %wide.load22, %wide.load22
%54 = getelementptr i64, i64* %41, i64 %index
%55 = bitcast i64* %54 to <4 x i64>*
store <4 x i64> %50, <4 x i64>* %55, align 8
%56 = getelementptr i64, i64* %54, i64 4
%57 = bitcast i64* %56 to <4 x i64>*
store <4 x i64> %51, <4 x i64>* %57, align 8
%58 = getelementptr i64, i64* %54, i64 8
%59 = bitcast i64* %58 to <4 x i64>*
store <4 x i64> %52, <4 x i64>* %59, align 8
%60 = getelementptr i64, i64* %54, i64 12
%61 = bitcast i64* %60 to <4 x i64>*
store <4 x i64> %53, <4 x i64>* %61, align 8
%index.next = add i64 %index, 16
%62 = icmp eq i64 %index.next, %n.vec
br i1 %62, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
%cmp.n = icmp eq i64 %36, %n.vec
br i1 %cmp.n, label %L100.loopexit, label %scalar.ph
...
}
@mcovalt Yes, Julia relies on LLVM to enable SIMD instructions.
I’ve found the explanation: SSE instructions are enabled even with the generic binaries when using an Array{Int32}
. So it’s just that you need AVX or AVX to vectorize 64-bit int multiplication.
I’m using a downloaded binary, yes.
Do you mean Array{Floatxx}
? And is there anywhere I can read about what instructions are SIMD’able? What about filling a Complex
array, for example?
I presume that AVX is some software that is independent from Julia?
Also on StackOverflow:
No, I meant Array{Int32}
. With Array{Float64}
, I get SIMD instructions with my custom build from source, but not with the generic binaries (probably for the same reason as why Array{Int}
didn’t use SIMD either).
You should be able to find lots of information on the Web about SIMD, this is not specific to Julia at all. Filling a Complex
array could work.
No, that’s a set of CPU instructions: Advanced Vector Extensions - Wikipedia
@mcovalt Please do not post the same question twice at the same time on different forums. Thanks.
So I need to build Julia from source, then…
I think rebuilding the system image with cpu_target="native", force=true
may be enough. You’ll need a working linker, though. See http://docs.julialang.org/en/stable/devdocs/sysimg/
I have a command for rebuilding the system image that worked for me on Windows and Linux listed in this blog post:
include(joinpath(dirname(JULIA_HOME),"share","julia","build_sysimg.jl")); build_sysimg(force=true)
For Windows, add:
Pkg.add("WinRPM"); WinRPM.install("gcc")
I put this as one of the 7 Julia Gotchas for this reason. Try all you want to get all of this super cool Julia SIMD going on, but you won’t actually see it unless you build from source or rebuild the system image (for 64-bit instructions, which are probably the most common ones!). After rebuilding the system image though, I noticed what @nalimilan has pointed out:
@code_llvm
and @code_native
will show SIMD vectorization is being used.@simd
, you’ll notice the compiler will many times automatically use SIMD.@ChrisRackauckas Thanks. Sounds like something which would be useful as a tip in the manual. We could even write a simpler version of the page I linked to, geared towards users rather than developers.
@ChrisRackauckas’s advice does indeed work, and I get vectorization in some cases. For example, I seem to get vectorization of multiplication, but not of trigonometric functions, like cos
, though the docs seem to indicate that they could be vectorized too.
How can I find out which operations should or could be SIMD’ed?
I have installed Julia 0.5 using sudo apt-get install
in Ubuntu (Julia is in /usr/bin/julia
). I can’t locate
the file build_sysimg.jl
. Because of that, joinpath(dirname(JULIA_HOME),"share","julia","build_sysimg.jl")
gives me an error.
Use the DATAROOTDIR version of the path from the devdocs. Hardcoding “share” and assuming the relative paths will always be the same is incorrect for some methods of installing Julia. You could also use the generic linux binaries instead of the PPA, given the latter is unmaintained.
@tkelman could you share a more robust version of the path? Would it be
joinpath(dirname(JULIA_HOME),DATAROOTDIR ,"julia","build_sysimg.jl")
I get DATAROOTDIR
is undefined.
include(joinpath(JULIA_HOME, Base.DATAROOTDIR, "julia", "build_sysimg.jl"))