Does Julia use SIMD instructions for broadcast operations?

question
performance
broadcast

#1

I see @simd can be used to vectorize for loops. How about for broadcast()?

  1. Will broadcast(f,A) translate to SIMD instructions of f operating on the elements A?
  2. Will multiple instances of f be sent to multiple threads?

#2
  1. Actually, in many cases loops will use SIMD vectorization even without @simd. That annotation is only required in cases where SIMD breaks some assumptions about how the computation is performed, in particular for floating point operations.

So broadcast will use SIMD when the compiler is able to make that optimization. I don’t think @simd can be specified for broadcast currently, though. To check whether SIMD is used, just look for vector annotations in the output of @code_llvm, or for SIMD instructions in the output of @code_native. For example:

f(X, Y) = broadcast(*, X, Y)
X = rand(Int, 1000);
Y = rand(Int, 1000);
@code_llvm f(X, Y)
@code_native f(X, Y)
  1. At the moment, only one thread will be used, but automatic multithreading will likely be supported at some point (maybe via a macro annotation).

#3

Would you mind showing an example of what this looks like? I cannot find any vector annotations in my @code_llvm output (using -O3). Nor do I know what SIMD instructions in @code_native look like.

For example, should this function yield SIMD instructions?

function square!(c::Array, x::Array)
    @assert size(c) == size(x)
    @simd for i in eachindex(x)
        @inbounds c[i] = x[i] * x[i]
    end
end

What do these instructions look like?


#4

Are you using a downloaded binary of Julia, or did you compile it yourself? This makes a difference as to the type of machine operations that are used.


#5

Thanks for the information. Very helpful. To go a bit further down the rabbit hole, is this the information regarding Julia’s SIMD decision making? Or does Julia do anything beyond LLVM auto-vectorization?


#6

@DNF I get the output below with the Julia 0.6 I compiled from source (for a Skylake CPU). But with the generic binaries no vectorization happens, which means I guess that only recent instructions can give a performance improvement in that case (and they are not enabled on generic builds). That’s still surprising since I would have expected at least that SSE/SSE2 instructions could be used, as they are available on all x86_64 CPUs…

julia> function square!(c::Array, x::Array)
           @assert size(c) == size(x)
           @simd for i in eachindex(x)
               @inbounds c[i] = x[i] * x[i]
           end
       end
square! (generic function with 1 method)

julia> X = rand(Int, 1000);

julia> Y = rand(Int, 1000);

julia> @code_llvm square!(X, Y)

define void @"julia_square!_68142"(i8**, i8**) #0 !dbg !5 {
...
min.iters.checked:                                ; preds = %if4.lr.ph.us11
  %n.vec = and i64 %36, -16
  %cmp.zero = icmp eq i64 %n.vec, 0
  br i1 %cmp.zero, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %min.iters.checked
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %42 = getelementptr i64, i64* %40, i64 %index
  %43 = bitcast i64* %42 to <4 x i64>*
  %wide.load = load <4 x i64>, <4 x i64>* %43, align 8
  %44 = getelementptr i64, i64* %42, i64 4
  %45 = bitcast i64* %44 to <4 x i64>*
  %wide.load20 = load <4 x i64>, <4 x i64>* %45, align 8
  %46 = getelementptr i64, i64* %42, i64 8
  %47 = bitcast i64* %46 to <4 x i64>*
  %wide.load21 = load <4 x i64>, <4 x i64>* %47, align 8
  %48 = getelementptr i64, i64* %42, i64 12
  %49 = bitcast i64* %48 to <4 x i64>*
  %wide.load22 = load <4 x i64>, <4 x i64>* %49, align 8
  %50 = mul <4 x i64> %wide.load, %wide.load
  %51 = mul <4 x i64> %wide.load20, %wide.load20
  %52 = mul <4 x i64> %wide.load21, %wide.load21
  %53 = mul <4 x i64> %wide.load22, %wide.load22
  %54 = getelementptr i64, i64* %41, i64 %index
  %55 = bitcast i64* %54 to <4 x i64>*
  store <4 x i64> %50, <4 x i64>* %55, align 8
  %56 = getelementptr i64, i64* %54, i64 4
  %57 = bitcast i64* %56 to <4 x i64>*
  store <4 x i64> %51, <4 x i64>* %57, align 8
  %58 = getelementptr i64, i64* %54, i64 8
  %59 = bitcast i64* %58 to <4 x i64>*
  store <4 x i64> %52, <4 x i64>* %59, align 8
  %60 = getelementptr i64, i64* %54, i64 12
  %61 = bitcast i64* %60 to <4 x i64>*
  store <4 x i64> %53, <4 x i64>* %61, align 8
  %index.next = add i64 %index, 16
  %62 = icmp eq i64 %index.next, %n.vec
  br i1 %62, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %cmp.n = icmp eq i64 %36, %n.vec
  br i1 %cmp.n, label %L100.loopexit, label %scalar.ph
...
}

@mcovalt Yes, Julia relies on LLVM to enable SIMD instructions.


#7

I’ve found the explanation: SSE instructions are enabled even with the generic binaries when using an Array{Int32}. So it’s just that you need AVX or AVX to vectorize 64-bit int multiplication.


#8

I’m using a downloaded binary, yes.

Do you mean Array{Floatxx}? And is there anywhere I can read about what instructions are SIMD’able? What about filling a Complex array, for example?

I presume that AVX is some software that is independent from Julia?


#9

Also on StackOverflow:


#10

No, I meant Array{Int32}. With Array{Float64}, I get SIMD instructions with my custom build from source, but not with the generic binaries (probably for the same reason as why Array{Int} didn’t use SIMD either).

You should be able to find lots of information on the Web about SIMD, this is not specific to Julia at all. Filling a Complex array could work.

No, that’s a set of CPU instructions: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

@mcovalt Please do not post the same question twice at the same time on different forums. Thanks.


#11

So I need to build Julia from source, then…


#12

I think rebuilding the system image with cpu_target="native", force=true may be enough. You’ll need a working linker, though. See http://docs.julialang.org/en/stable/devdocs/sysimg/


#13

I have a command for rebuilding the system image that worked for me on Windows and Linux listed in this blog post:

include(joinpath(dirname(JULIA_HOME),"share","julia","build_sysimg.jl")); build_sysimg(force=true)

For Windows, add:

Pkg.add("WinRPM"); WinRPM.install("gcc")

I put this as one of the 7 Julia Gotchas for this reason. Try all you want to get all of this super cool Julia SIMD going on, but you won’t actually see it unless you build from source or rebuild the system image (for 64-bit instructions, which are probably the most common ones!). After rebuilding the system image though, I noticed what @nalimilan has pointed out:

  1. Your @code_llvm and @code_native will show SIMD vectorization is being used.
  2. Even without @simd, you’ll notice the compiler will many times automatically use SIMD.

#14

@ChrisRackauckas Thanks. Sounds like something which would be useful as a tip in the manual. We could even write a simpler version of the page I linked to, geared towards users rather than developers.


#15

@ChrisRackauckas’s advice does indeed work, and I get vectorization in some cases. For example, I seem to get vectorization of multiplication, but not of trigonometric functions, like cos, though the docs seem to indicate that they could be vectorized too.

How can I find out which operations should or could be SIMD’ed?


#16

I have installed Julia 0.5 using sudo apt-get install in Ubuntu (Julia is in /usr/bin/julia). I can’t locate the file build_sysimg.jl. Because of that, joinpath(dirname(JULIA_HOME),"share","julia","build_sysimg.jl") gives me an error.


#17

Use the DATAROOTDIR version of the path from the devdocs. Hardcoding “share” and assuming the relative paths will always be the same is incorrect for some methods of installing Julia. You could also use the generic linux binaries instead of the PPA, given the latter is unmaintained.


#18

@tkelman could you share a more robust version of the path? Would it be

joinpath(dirname(JULIA_HOME),DATAROOTDIR ,"julia","build_sysimg.jl")

I get DATAROOTDIR is undefined.


#19
include(joinpath(JULIA_HOME, Base.DATAROOTDIR, "julia", "build_sysimg.jl"))