Help understanding vectorization (or lack thereof)

Hello,

I have these simple functions, one with @simd and one without:

@noinline function msum(x)
       s=zero(eltype(x))
       for i in 1:length(x)
        @inbounds s+=x[i]
       end
       s
       end

 @noinline function vsum(x)
       s=zero(eltype(x))
       @simd for i in 1:length(x)
        @inbounds s+=x[i]
       end
       s
       end

Now I test for the vectorization, on vectors, NO VECTORIZATION (ratio of times=1)

[(x=rand(10^i);t1=@elapsed s1=msum(x);t2=@elapsed s2=vsum(x);(t1,t2,t1/t2,abs(s1-s2)/s1)) for i=4:8]
 (9.186e-6, 9.159e-6, 1.002947920078611, 0.0)       
 (9.1058e-5, 9.0979e-5, 1.0008683322524978, 0.0)    
 (0.000911158, 0.000910397, 1.000835899063815, 0.0) 
 (0.01053129, 0.01044051, 1.0086949775442005, 0.0)  
 (0.104159249, 0.104223552, 0.9993830281278457, 0.0)

Now, if x=rand(10^i,1) (same size, but an Array and not a Vector):

[(x=rand(10^i,1);t1=@elapsed s1=msum(x);t2=@elapsed s2=vsum(x);(t1,t2,t1/t2,abs(s1-s2)/s1)) for i=4:8]
 (9.186e-6, 3.848e-6, 2.387214137214137, 1.0953193524755403e-15)       
 (9.1024e-5, 2.7342e-5, 3.3290907760953843, 3.783505595684507e-15)     
 (0.00091261, 0.000277669, 3.286683065088289, 3.2265902992163964e-14)  
 (0.010525071, 0.005083212, 2.0705551922681957, 1.2906940917038317e-13)
 (0.104183433, 0.051653709, 2.0169593823359326, 4.5443220856290344e-14)

The loop vectorizes and we obtain a speedup.

Can someone help me understand what is going on?

julia> versioninfo()
Julia Version 0.7.0-alpha.0
Commit 22590d529d (2018-05-31 00:07 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, sandybridge)

You seem to be benchmarking in global scope. Use BenchmarkTooks.jl for these kinds of microbenchmarks.

It seems what is happening is that your msum IS automatically vectorizing for vector inputs, that is why you don’t see a speed-up compared to vsum, they are both vectorized.
The timings you got also suggests that.

EDIT: I misread the results, the timings does not suggest what I though.

1 Like

That is incorrect, notice that the two examples are summing vectors of the same length (one masquerading as a nx1 array) and the you get a speedup factor of two on the array but not on the vector.

That does not change the results, I tried it.

I cannot reproduce (neither on master nor 0.6).

julia> N=10_000;v=rand(N); vm=rand(N,1); v32=rand(Float32,N); v32m=rand(Float32,N,1);
julia> begin
       println("F64 vec")
       @btime msum($v);
       @btime vsum($v);
       println("F64 mat")
       @btime msum($vm);
       @btime vsum($vm);
       println("F32vec")
       @btime msum($v32);
       @btime vsum($v32);
        println("F32mat")
        @btime msum($v32m);
        @btime vsum($v32m);
      end;
              
F64 vec
  13.113 ��s (0 allocations: 0 bytes)
  1.150 ��s (0 allocations: 0 bytes)
F64 mat
  13.113 ��s (0 allocations: 0 bytes)
  1.152 ��s (0 allocations: 0 bytes)
F32vec
  13.111 ��s (0 allocations: 0 bytes)
  592.989 ns (0 allocations: 0 bytes)
F32mat
  13.113 ��s (0 allocations: 0 bytes)
  596.453 ns (0 allocations: 0 bytes)

On my system (0.7-alpha):

F64 vec
  9.114 μs (0 allocations: 0 bytes)
  9.120 μs (0 allocations: 0 bytes)
F64 mat
  9.113 μs (0 allocations: 0 bytes)
  1.801 μs (0 allocations: 0 bytes)
F32vec
  9.118 μs (0 allocations: 0 bytes)
  9.115 μs (0 allocations: 0 bytes)
F32mat
  9.112 μs (0 allocations: 0 bytes)
  908.525 ns (0 allocations: 0 bytes)

Could this be system dependent?

Are you sure?

@noinline function msum(x)
    s = zero(eltype(x))
    for i in 1:length(x)
        @inbounds s += x[i]
    end
    return s
end

@noinline function vsum(x)
    s = zero(eltype(x))
    @simd for i in 1:length(x)
        @inbounds s += x[i]
    end
    return s
end

r = rand(10^6);
r_ = reshape(r, :, 1);

@btime msum($r)
@btime vsum($r)

@btime msum($r_)
@btime vsum($r_)

# output
861.046 μs (0 allocations: 0 bytes)
260.249 μs (0 allocations: 0 bytes)
861.430 μs (0 allocations: 0 bytes)
259.803 μs (0 allocations: 0 bytes)

Edit: How strange.

Strange indeed! I get different results… Does anyone have a hint on how this can be system dependent?

The best way to check for vectorization is to look at the output of @code_llvm.

This is also system dependent because you need to be compiling for an architecture that has the necessary instruction set.

So for reference, here’s my versioninfo():

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

And here’s what I get from @code_llvm

julia> @code_llvm msum(r)

define double @julia_msum_62759(i8** dereferenceable(40)) #0 !dbg !5 {
top:
  %1 = getelementptr inbounds i8*, i8** %0, i64 1
  %2 = bitcast i8** %1 to i64*
  %3 = load i64, i64* %2, align 8
  %4 = icmp slt i64 %3, 1
  br i1 %4, label %L18, label %if.lr.ph

if.lr.ph:                                         ; preds = %top
  %5 = bitcast i8** %0 to double**
  %6 = load double*, double** %5, align 8
  br label %if

if:                                               ; preds = %if.lr.ph, %if
  %s.03 = phi double [ 0.000000e+00, %if.lr.ph ], [ %11, %if ]
  %"#temp#.02" = phi i64 [ 1, %if.lr.ph ], [ %7, %if ]
  %7 = add i64 %"#temp#.02", 1
  %8 = add i64 %"#temp#.02", -1
  %9 = getelementptr double, double* %6, i64 %8
  %10 = load double, double* %9, align 8
  %11 = fadd double %s.03, %10
  %12 = icmp eq i64 %"#temp#.02", %3
  br i1 %12, label %L18.loopexit, label %if

L18.loopexit:                                     ; preds = %if
  br label %L18

L18:                                              ; preds = %L18.loopexit, %top
  %s.0.lcssa = phi double [ 0.000000e+00, %top ], [ %11, %L18.loopexit ]
  ret double %s.0.lcssa
}

This shows that msum is not being vectorized. Compare this with the output of

julia> @code_llvm vsum(r)
...
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <2 x double> [ %18, %vector.ph ], [ %23, %vector.body ]
  %vec.phi122 = phi <2 x double> [ zeroinitializer, %vector.ph ], [ %24, %vector.body ]
  %19 = getelementptr double, double* %14, i64 %index
  %20 = bitcast double* %19 to <2 x double>*
  %wide.load = load <2 x double>, <2 x double>* %20, align 8
  %21 = getelementptr double, double* %19, i64 2
  %22 = bitcast double* %21 to <2 x double>*
  %wide.load124 = load <2 x double>, <2 x double>* %22, align 8
  %23 = fadd fast <2 x double> %vec.phi, %wide.load
  %24 = fadd fast <2 x double> %vec.phi122, %wide.load124
  %index.next = add i64 %index, 4
  %25 = icmp eq i64 %index.next, %n.vec
  br i1 %25, label %middle.block, label %vector.body
...

Notice the presence of lines like: %wide.load = load <2 x double>, <2 x double>* %20, align 8. This is the best indication that vsum successfully vectorized. The timings on my machine are also basically the same as for @DNF

So what do you get from the output of @code_llvm on your machine?

I see the vectorization instructions for vsum(Array) but no instructions for vsum(Vector).

For reference:

julia> @code_llvm vsum(rand(10))

; Function vsum
; Location: REPL[32]:2
define double @julia_vsum_36461(%jl_value_t addrspace(10)* nonnull dereferenceable(40)) {
top:
; Location: REPL[32]:3
; Function macro expansion; {
; Location: simdloop.jl:65
; Function length; {
; Location: array.jl:174
  %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %2 = bitcast %jl_value_t addrspace(11)* %1 to %jl_array_t addrspace(11)*
  %3 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %2, i64 0, i32 1
  %4 = load i64, i64 addrspace(11)* %3, align 8
;}
; Function Colon; {
; Location: range.jl:5
; Function Type; {
; Location: range.jl:185
; Function unitrange_last; {
; Location: range.jl:190
; Function >=; {
; Location: operators.jl:330
; Function <=; {
; Location: int.jl:419
  %5 = icmp sgt i64 %4, 0
;}}
  %6 = select i1 %5, i64 %4, i64 0
;}}}
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function _length; {
; Location: abstractarray.jl:157
; Function axes; {
; Location: abstractarray.jl:83
; Function size; {
; Location: range.jl:350
; Function length; {
; Location: range.jl:411
; Function checked_sub; {
; Location: checked.jl:226
; Function sub_with_overflow; {
; Location: checked.jl:198
  %7 = add nsw i64 %6, -1
;}}
; Function checked_add; {
; Location: checked.jl:169
; Function add_with_overflow; {
; Location: checked.jl:136
  %8 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %7, i64 1)
  %9 = extractvalue { i64, i1 } %8, 1
;}
; Location: checked.jl:170
  br i1 %9, label %L28, label %L33.lr.ph

L33.lr.ph:                                        ; preds = %top
; Location: checked.jl:169
; Function add_with_overflow; {
; Location: checked.jl:136
  %10 = extractvalue { i64, i1 } %8, 0
  %11 = icmp sgt i64 %10, 0
;}
; Location: checked.jl:170
  br i1 %11, label %L45.lr.ph.us.us, label %L100

L45.lr.ph.us.us:                                  ; preds = %L33.lr.ph
  %12 = bitcast %jl_value_t addrspace(11)* %1 to double* addrspace(11)*
;}}}}}}
; Location: simdloop.jl:71
  br label %L61.us.us

L61.us.us:                                        ; preds = %L45.lr.ph.us.us, %L61.us.us
  %value_phi664.us.us = phi i64 [ 0, %L45.lr.ph.us.us ], [ %17, %L61.us.us ]
  %value_phi563.us.us = phi double [ 0.000000e+00, %L45.lr.ph.us.us ], [ %16, %L61.us.us ]
; Location: simdloop.jl:73
; Function macro expansion; {
; Location: REPL[32]:4
; Function getindex; {
; Location: array.jl:707
  %13 = load double*, double* addrspace(11)* %12, align 8
  %14 = getelementptr double, double* %13, i64 %value_phi664.us.us
  %15 = load double, double* %14, align 8
;}
; Function +; {
; Location: float.jl:394
  %16 = fadd fast double %value_phi563.us.us, %15
;}}
; Location: simdloop.jl:74
; Function +; {
; Location: int.jl:53
  %17 = add nuw nsw i64 %value_phi664.us.us, 1
;}
; Location: simdloop.jl:71
; Function <; {
; Location: int.jl:49
  %18 = icmp ult i64 %17, %10
;}
  br i1 %18, label %L61.us.us, label %L100

L28:                                              ; preds = %top
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function _length; {
; Location: abstractarray.jl:157
; Function axes; {
; Location: abstractarray.jl:83
; Function size; {
; Location: range.jl:350
; Function length; {
; Location: range.jl:411
; Function checked_add; {
; Location: checked.jl:170
  call void @julia_throw_overflowerr_binaryop_24633(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 47979298915040 to %jl_value_t*) to %jl_value_t addrspace(10)*), i64 %7, i64 1)
  call void @llvm.trap()
  unreachable

L100:                                             ; preds = %L61.us.us, %L33.lr.ph
  %value_phi10.lcssa = phi double [ 0.000000e+00, %L33.lr.ph ], [ %16, %L61.us.us ]
;}}}}}}}
; Location: REPL[32]:6
  ret double %value_phi10.lcssa
julia> @code_llvm vsum(rand(10,1))

; Function vsum
; Location: REPL[32]:2
define double @julia_vsum_36468(%jl_value_t addrspace(10)* nonnull dereferenceable(40)) {
top:
  %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %2 = bitcast %jl_value_t addrspace(11)* %1 to %jl_array_t addrspace(11)*
  %3 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %2, i64 0, i32 1
  %4 = load i64, i64 addrspace(11)* %3, align 8
; Location: REPL[32]:3
; Function macro expansion; {
; Location: simdloop.jl:65
; Function Colon; {
; Location: range.jl:5
; Function Type; {
; Location: range.jl:185
; Function unitrange_last; {
; Location: range.jl:190
; Function >=; {
; Location: operators.jl:330
; Function <=; {
; Location: int.jl:419
  %5 = icmp sgt i64 %4, 0
;}}
  %6 = select i1 %5, i64 %4, i64 0
;}}}
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function _length; {
; Location: abstractarray.jl:157
; Function axes; {
; Location: abstractarray.jl:83
; Function size; {
; Location: range.jl:350
; Function length; {
; Location: range.jl:411
; Function checked_sub; {
; Location: checked.jl:226
; Function sub_with_overflow; {
; Location: checked.jl:198
  %7 = add nsw i64 %6, -1
;}}
; Function checked_add; {
; Location: checked.jl:169
; Function add_with_overflow; {
; Location: checked.jl:136
  %8 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %7, i64 1)
  %9 = extractvalue { i64, i1 } %8, 1
;}
; Location: checked.jl:170
  br i1 %9, label %L28, label %L33.lr.ph

L33.lr.ph:                                        ; preds = %top
; Location: checked.jl:169
; Function add_with_overflow; {
; Location: checked.jl:136
  %10 = extractvalue { i64, i1 } %8, 0
  %11 = icmp sgt i64 %10, 0
;}
; Location: checked.jl:170
  br i1 %11, label %L45.lr.ph.us.us, label %L100

L45.lr.ph.us.us:                                  ; preds = %L33.lr.ph
  %12 = bitcast %jl_value_t addrspace(11)* %1 to double* addrspace(11)*
  %13 = load double*, double* addrspace(11)* %12, align 8
;}}}}}}
; Location: simdloop.jl:71
  %min.iters.check = icmp ult i64 %6, 16
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L45.lr.ph.us.us
  %n.vec = and i64 %6, 9223372036854775792
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
; Location: simdloop.jl:74
; Function +; {
; Location: int.jl:53
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %22, %vector.body ]
  %vec.phi91 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %23, %vector.body ]
  %vec.phi92 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %24, %vector.body ]
  %vec.phi93 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %25, %vector.body ]
;}
; Location: simdloop.jl:73
; Function macro expansion; {
; Location: REPL[32]:4
; Function getindex; {
; Location: array.jl:707
  %14 = getelementptr double, double* %13, i64 %index
  %15 = bitcast double* %14 to <4 x double>*
  %wide.load = load <4 x double>, <4 x double>* %15, align 8
  %16 = getelementptr double, double* %14, i64 4
  %17 = bitcast double* %16 to <4 x double>*
  %wide.load94 = load <4 x double>, <4 x double>* %17, align 8
  %18 = getelementptr double, double* %14, i64 8
  %19 = bitcast double* %18 to <4 x double>*
  %wide.load95 = load <4 x double>, <4 x double>* %19, align 8
  %20 = getelementptr double, double* %14, i64 12
  %21 = bitcast double* %20 to <4 x double>*
  %wide.load96 = load <4 x double>, <4 x double>* %21, align 8
;}
; Function +; {
; Location: float.jl:394
  %22 = fadd fast <4 x double> %vec.phi, %wide.load
  %23 = fadd fast <4 x double> %vec.phi91, %wide.load94
  %24 = fadd fast <4 x double> %vec.phi92, %wide.load95
  %25 = fadd fast <4 x double> %vec.phi93, %wide.load96
;}}
; Location: simdloop.jl:74
; Function +; {
; Location: int.jl:53
  %index.next = add i64 %index, 16
  %26 = icmp eq i64 %index.next, %n.vec
  br i1 %26, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
;}
; Location: simdloop.jl:73
; Function macro expansion; {
; Location: REPL[32]:4
; Function +; {
; Location: float.jl:394
  %bin.rdx = fadd fast <4 x double> %23, %22
  %bin.rdx97 = fadd fast <4 x double> %24, %bin.rdx
  %bin.rdx98 = fadd fast <4 x double> %25, %bin.rdx97
  %rdx.shuf = shufflevector <4 x double> %bin.rdx98, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx99 = fadd fast <4 x double> %bin.rdx98, %rdx.shuf
  %rdx.shuf100 = shufflevector <4 x double> %bin.rdx99, <4 x double> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx101 = fadd fast <4 x double> %bin.rdx99, %rdx.shuf100
  %27 = extractelement <4 x double> %bin.rdx101, i32 0
  %cmp.n = icmp eq i64 %6, %n.vec
;}}
; Location: simdloop.jl:71
  br i1 %cmp.n, label %L100, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L45.lr.ph.us.us
  %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L45.lr.ph.us.us ]
  %bc.merge.rdx = phi double [ %27, %middle.block ], [ 0.000000e+00, %L45.lr.ph.us.us ]
  br label %L61.us.us

L61.us.us:                                        ; preds = %scalar.ph, %L61.us.us
  %value_phi665.us.us = phi i64 [ %bc.resume.val, %scalar.ph ], [ %31, %L61.us.us ]
  %value_phi564.us.us = phi double [ %bc.merge.rdx, %scalar.ph ], [ %30, %L61.us.us ]
; Location: simdloop.jl:73
; Function macro expansion; {
; Location: REPL[32]:4
; Function getindex; {
; Location: array.jl:707
  %28 = getelementptr double, double* %13, i64 %value_phi665.us.us
  %29 = load double, double* %28, align 8
;}
; Function +; {
; Location: float.jl:394
  %30 = fadd fast double %value_phi564.us.us, %29
;}}
; Location: simdloop.jl:74
; Function +; {
; Location: int.jl:53
  %31 = add nuw nsw i64 %value_phi665.us.us, 1
;}
; Location: simdloop.jl:71
; Function <; {
; Location: int.jl:49
  %32 = icmp ult i64 %31, %10
;}
  br i1 %32, label %L61.us.us, label %L100

L28:                                              ; preds = %top
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function _length; {
; Location: abstractarray.jl:157
; Function axes; {
; Location: abstractarray.jl:83
; Function size; {
; Location: range.jl:350
; Function length; {
; Location: range.jl:411
; Function checked_add; {
; Location: checked.jl:170
  call void @julia_throw_overflowerr_binaryop_24633(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 47979298915040 to %jl_value_t*) to %jl_value_t addrspace(10)*), i64 %7, i64 1)
  call void @llvm.trap()
  unreachable

L100:                                             ; preds = %L61.us.us, %middle.block, %L33.lr.ph
  %value_phi10.lcssa = phi double [ 0.000000e+00, %L33.lr.ph ], [ %30, %L61.us.us ], [ %27, %middle.block ]
;}}}}}}}
; Location: REPL[32]:6
  ret double %value_phi10.lcssa

Hmm, I don’t know what’s going on there, but it might be worth trying again with -O3. There are some cases that only vectorize with that option.

It’s system dependent and requires that your system image has been built to be architecture-specific. This latter part is done automatically if you built from source, but is not done automatically for the generic binaries (since otherwise they would segfault if your processor was old and newer instructions were generated). This is where rebuilding the system image comes into play to allow the newer instructions to be used.

I built a system image (and built from source), and in this case it didn’t help with my use case (I wanted to get avx512 on KNL). However, I normally work on a cluster with heterogeneous CPUs and cannot always have a native build.

The exciting thing is that llvm on v0.7 (distribution binaries) manages to emit optimal code for two different CPUs added remotely :slight_smile:

I was trying to understand why it doesn’t in this case, to “nudge” it that way :wink:

-O3 does not help either.

Try --math-mode=fast, then. It automatically SIMD-ed in both cases for me.