Did you replace @turbo
with @inbounds @simd
or just @simd
?
@inbunds @simd
would be SIMD, but @simd
here by itself would not.
You can use code_llvm
or Cthulhu.jl to inspect code and see if it was vectorized.
Here is what I get, the loop for testSum
is filled with branches, and is not SIMD:
L17: ; preds = %idxend13, %L17.preheader
%value_phi3 = phi i64 [ %65, %idxend13 ], [ 1, %L17.preheader ]
%value_phi5 = phi double [ %64, %idxend13 ], [ 0.000000e+00, %L17.preheader ]
%exitcond.not = icmp eq i64 %value_phi3, %39
br i1 %exitcond.not, label %oob, label %idxend
L40: ; preds = %idxend13, %top
%value_phi17 = phi double [ 0.000000e+00, %top ], [ %64, %idxend13 ]
ret double %value_phi17
oob: ; preds = %L17
%40 = alloca i64, align 8
store i64 %39, i64* %40, align 8
call void @jl_bounds_error_ints({}* %0, i64* nonnull %40, i64 1)
unreachable
idxend: ; preds = %L17
%41 = add nsw i64 %value_phi3, -1
%42 = getelementptr inbounds i64, i64* %18, i64 %41
%43 = load i64, i64* %42, align 8
%44 = add i64 %43, -1
%45 = icmp ult i64 %44, %21
br i1 %45, label %idxend7, label %oob6
oob6: ; preds = %idxend
%46 = alloca i64, align 8
store i64 %43, i64* %46, align 8
call void @jl_bounds_error_ints({}* %3, i64* nonnull %46, i64 1)
unreachable
idxend7: ; preds = %idxend
%47 = icmp ult i64 %44, %24
br i1 %47, label %idxend9, label %oob8
oob8: ; preds = %idxend7
%48 = alloca i64, align 8
store i64 %43, i64* %48, align 8
call void @jl_bounds_error_ints({}* %5, i64* nonnull %48, i64 1)
unreachable
idxend9: ; preds = %idxend7
%49 = getelementptr inbounds double, double* %26, i64 %44
%50 = load double, double* %49, align 8
%51 = getelementptr inbounds double, double* %28, i64 %44
%52 = load double, double* %51, align 8
%53 = fmul double %50, %52
%54 = icmp ult i64 %44, %31
br i1 %54, label %idxend11, label %oob10
oob10: ; preds = %idxend9
%55 = alloca i64, align 8
store i64 %43, i64* %55, align 8
call void @jl_bounds_error_ints({}* %7, i64* nonnull %55, i64 1)
unreachable
idxend11: ; preds = %idxend9
%56 = icmp ult i64 %44, %34
br i1 %56, label %idxend13, label %oob12
oob12: ; preds = %idxend11
%57 = alloca i64, align 8
store i64 %43, i64* %57, align 8
call void @jl_bounds_error_ints({}* %9, i64* nonnull %57, i64 1)
unreachable
idxend13: ; preds = %idxend11
%58 = getelementptr inbounds double, double* %36, i64 %44
%59 = load double, double* %58, align 8
%60 = getelementptr inbounds double, double* %38, i64 %44
%61 = load double, double* %60, align 8
%62 = fmul double %59, %61
%63 = fadd double %53, %62
%64 = fadd double %value_phi5, %63
%.not.not20 = icmp eq i64 %value_phi3, %13
%65 = add nuw nsw i64 %value_phi3, 1
br i1 %.not.not20, label %L40, label %L17
}
The branches are for all the bounds checks.
This is testSum_avx
:
L37: ; preds = %L37, %top
%value_phi464 = phi i64 [ %res.i30, %L37 ], [ 0, %top ]
%value_phi63 = phi <8 x double> [ %res.i31, %L37 ], [ zeroinitializer, %top ]
%res.i48 = shl nsw i64 %value_phi464, 3
%ptr.1.i45 = getelementptr inbounds i8, i8* %18, i64 %res.i48
%ptr.2.i46 = bitcast i8* %ptr.1.i45 to <8 x i64>*
%res.i47 = load <8 x i64>, <8 x i64>* %ptr.2.i46, align 8
%ptr.1.i43 = getelementptr inbounds double, double* %ptr.1.i59, <8 x i64> %res.i47
%res.i44 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %ptr.1.i43, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
%ptr.1.i40 = getelementptr inbounds double, double* %ptr.1.i56, <8 x i64> %res.i47
%res.i41 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %ptr.1.i40, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
%ptr.1.i37 = getelementptr inbounds double, double* %ptr.1.i53, <8 x i64> %res.i47
%res.i38 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %ptr.1.i37, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
%ptr.1.i34 = getelementptr inbounds double, double* %ptr.1.i50, <8 x i64> %res.i47
%res.i35 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %ptr.1.i34, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
%res.i32 = call reassoc nsz arcp contract afn <8 x double> @llvm.fma.v8f64(<8 x double> %res.i38, <8 x double> %res.i35, <8 x double> %value_phi63)
%res.i31 = call reassoc nsz arcp contract afn <8 x double> @llvm.fma.v8f64(<8 x double> %res.i44, <8 x double> %res.i41, <8 x double> %res.i32)
%res.i30 = add nuw nsw i64 %value_phi464, 8
%.not = icmp eq i64 %res.i30, %24
br i1 %.not, label %L47, label %L37
Highlights:
%res.i47 = load <8 x i64>, <8 x i64>* %ptr.2.i46, align 8
This loads <8 x i64>
(a SIMD vector) from `indices. Loading 8 elements in row like this is fast.
Then, there are 4 instances that look like
%ptr.1.i43 = getelementptr inbounds double, double* %ptr.1.i59, <8 x i64> %res.i47
%res.i44 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %ptr.1.i43, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
That is, it takes a base pointer to the array, and then getelementptr
to index into it using the vector %res.i47
. This returns a vector of pointers.
Then it uses gather
to load a vector of <8 x double>
.
gather
is very slow. It varies by CPU, but gathering X elements tends to take as long as loading X elements one by one, meaning if the slowest part of your code is using gather
, it’d be hard to get a speed up at all.
On my laptop, I get:
julia> versioninfo()
Julia Version 1.6.2-pre.0
Commit dd122918ce* (2021-04-23 21:21 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
JULIA_NUM_THREADS = 8
julia> function testSum_inbounds(indices,Vectors)
A,B,C,D = Vectors
sum = 0.
@inbounds for ksum in eachindex(indices)
k = indices[ksum]
sum += A[k] * B[k] +C[k] * D[k]
end
return sum
end
testSum_inbounds (generic function with 1 method)
julia> function testSum_avx_u2(indices,Vectors)
A,B,C,D = Vectors
sum = 0.
@turbo unroll=2 for ksum in eachindex(indices)
k = indices[ksum]
sum += A[k] * B[k] +C[k] * D[k]
end
return sum
end
testSum_avx_u2 (generic function with 1 method)
julia> function testSum_avx(indices,Vectors)
A,B,C,D = Vectors
sum = 0.
@turbo for ksum in eachindex(indices)
k = indices[ksum]
sum += A[k] * B[k] +C[k] * D[k]
end
return sum
end
testSum_avx (generic function with 1 method)
julia> function testSum_simd(indices,Vectors)
A,B,C,D = Vectors
sum = 0.
@inbounds @simd for ksum in eachindex(indices)
k = indices[ksum]
sum += A[k] * B[k] +C[k] * D[k]
end
return sum
end
testSum_simd (generic function with 1 method)
julia> function testSum(indices,Vectors)
A,B,C,D = Vectors
sum = 0.
for ksum in eachindex(indices)
k = indices[ksum]
sum += A[k] * B[k] +C[k] * D[k]
end
return sum
end
testSum (generic function with 1 method)
julia> @btime testSum_inbounds($indices,$Vectors)
421.201 ns (0 allocations: 0 bytes)
247.660070648417
julia> @btime testSum_avx($indices,$Vectors)
304.891 ns (0 allocations: 0 bytes)
247.66007064841654
julia> @btime testSum_avx_u2($indices,$Vectors)
299.310 ns (0 allocations: 0 bytes)
247.66007064841654
julia> @btime testSum_simd($indices,$Vectors)
295.878 ns (0 allocations: 0 bytes)
247.66007064841656
julia> @btime testSum($indices,$Vectors)
487.128 ns (0 allocations: 0 bytes)
247.660070648417
julia> function testSum_avx_u4(indices,Vectors)
A,B,C,D = Vectors
sum = 0.
@turbo unroll=4 for ksum in eachindex(indices)
k = indices[ksum]
sum += A[k] * B[k] +C[k] * D[k]
end
return sum
end
testSum_avx_u4 (generic function with 1 method)
julia> @btime testSum_avx_u4($indices,$Vectors)
287.645 ns (0 allocations: 0 bytes)
247.66007064841654
So it looks like LoopVectorization might not be unrolling enough. By default, it chooses not to unroll at all, which yields a slower time than forcing it to unroll by 4.
@inbounds @simd
does unroll by 4, and it looks like that choice is a bit better here.
@gustaphe LoopVectorization only supports AbstractRange{<:Integer}
iterators, meaning the former works with @turbo
but the latter does not.