With Int64
, I get
julia> using BenchmarkTools
julia> mat = randn(1000, 1000);
julia> @btime sum(>(1), $mat)
295.130 μs (0 allocations: 0 bytes)
158577
julia> @btime sum(x -> x > 1, $mat)
276.217 μs (0 allocations: 0 bytes)
158577
julia> @btime count(>(1), $mat)
265.832 μs (0 allocations: 0 bytes)
158577
julia> @btime count(x -> x > 1, $mat)
267.103 μs (0 allocations: 0 bytes)
158577
My CPU has AVX512, which can efficiently convert Int64
to Float64
using SIMD instructions. CPUs with AVX2 but not AVX512 can only do this for Int32
.
But I don’t think that should be related. The compiler should not have a problem hoisting a conversion out of a loop.
That’s what happened on my system anyway:
vector.ph: ; preds = %L68.preheader
%n.vec = and i64 %58, -32
%59 = insertelement <8 x i64> <i64 poison, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0>, i64 %52, i64 0
%broadcast.splatinsert = insertelement <8 x double> poison, double %36, i64 0
%broadcast.splat = shufflevector <8 x double> %broadcast.splatinsert, <8 x double> poison, <8 x i32> zeroinitializer
%broadcast.splatinsert34 = insertelement <8 x i1> poison, i1 %43, i64 0
%broadcast.splat35 = shufflevector <8 x i1> %broadcast.splatinsert34, <8 x i1> poison, <8 x i32> zeroinitializer
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <8 x i64> [ %59, %vector.ph ], [ %89, %vector.body ]
%vec.phi22 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %90, %vector.body ]
%vec.phi23 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %91, %vector.body ]
%vec.phi24 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %92, %vector.body ]
%60 = add i64 %54, %index
%61 = getelementptr inbounds double, double* %29, i64 %60
%62 = bitcast double* %61 to <8 x double>*
%wide.load = load <8 x double>, <8 x double>* %62, align 8
%63 = getelementptr inbounds double, double* %61, i64 8
%64 = bitcast double* %63 to <8 x double>*
%wide.load25 = load <8 x double>, <8 x double>* %64, align 8
%65 = getelementptr inbounds double, double* %61, i64 16
%66 = bitcast double* %65 to <8 x double>*
%wide.load26 = load <8 x double>, <8 x double>* %66, align 8
%67 = getelementptr inbounds double, double* %61, i64 24
%68 = bitcast double* %67 to <8 x double>*
%wide.load27 = load <8 x double>, <8 x double>* %68, align 8
%69 = fcmp ogt <8 x double> %wide.load, %broadcast.splat
%70 = fcmp ogt <8 x double> %wide.load25, %broadcast.splat
%71 = fcmp ogt <8 x double> %wide.load26, %broadcast.splat
%72 = fcmp ogt <8 x double> %wide.load27, %broadcast.splat
%73 = fcmp oeq <8 x double> %wide.load, %broadcast.splat
%74 = fcmp oeq <8 x double> %wide.load25, %broadcast.splat
%75 = fcmp oeq <8 x double> %wide.load26, %broadcast.splat
%76 = fcmp oeq <8 x double> %wide.load27, %broadcast.splat
%77 = and <8 x i1> %73, %broadcast.splat35
%78 = and <8 x i1> %74, %broadcast.splat35
%79 = and <8 x i1> %75, %broadcast.splat35
%80 = and <8 x i1> %76, %broadcast.splat35
%81 = or <8 x i1> %69, %77
%82 = or <8 x i1> %70, %78
%83 = or <8 x i1> %71, %79
%84 = or <8 x i1> %72, %80
%85 = zext <8 x i1> %81 to <8 x i64>
%86 = zext <8 x i1> %82 to <8 x i64>
%87 = zext <8 x i1> %83 to <8 x i64>
%88 = zext <8 x i1> %84 to <8 x i64>
%89 = add <8 x i64> %vec.phi, %85
%90 = add <8 x i64> %vec.phi22, %86
%91 = add <8 x i64> %vec.phi23, %87
%92 = add <8 x i64> %vec.phi24, %88
%index.next = add nuw i64 %index, 32
%93 = icmp eq i64 %index.next, %n.vec
br i1 %93, label %middle.block, label %vector.body
The x -> x>1
was still faster for me.
vector.ph: ; preds = %L68.preheader
%n.vec = and i64 %58, -32
%59 = insertelement <8 x i64> <i64 poison, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0>, i64 %52, i64 0
%broadcast.splatinsert = insertelement <8 x double> poison, double %36, i64 0
%broadcast.splat = shufflevector <8 x double> %broadcast.splatinsert, <8 x double> poison, <8 x i32> zeroinitializer
%broadcast.splatinsert34 = insertelement <8 x i1> poison, i1 %43, i64 0
%broadcast.splat35 = shufflevector <8 x i1> %broadcast.splatinsert34, <8 x i1> poison, <8 x i32> zeroinitializer
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <8 x i64> [ %59, %vector.ph ], [ %89, %vector.body ]
%vec.phi22 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %90, %vector.body ]
%vec.phi23 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %91, %vector.body ]
%vec.phi24 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %92, %vector.body ]
%60 = add i64 %54, %index
%61 = getelementptr inbounds double, double* %29, i64 %60
%62 = bitcast double* %61 to <8 x double>*
%wide.load = load <8 x double>, <8 x double>* %62, align 8
%63 = getelementptr inbounds double, double* %61, i64 8
%64 = bitcast double* %63 to <8 x double>*
%wide.load25 = load <8 x double>, <8 x double>* %64, align 8
%65 = getelementptr inbounds double, double* %61, i64 16
%66 = bitcast double* %65 to <8 x double>*
%wide.load26 = load <8 x double>, <8 x double>* %66, align 8
%67 = getelementptr inbounds double, double* %61, i64 24
%68 = bitcast double* %67 to <8 x double>*
%wide.load27 = load <8 x double>, <8 x double>* %68, align 8
%69 = fcmp ogt <8 x double> %wide.load, %broadcast.splat
%70 = fcmp ogt <8 x double> %wide.load25, %broadcast.splat
%71 = fcmp ogt <8 x double> %wide.load26, %broadcast.splat
%72 = fcmp ogt <8 x double> %wide.load27, %broadcast.splat
%73 = fcmp oeq <8 x double> %wide.load, %broadcast.splat
%74 = fcmp oeq <8 x double> %wide.load25, %broadcast.splat
%75 = fcmp oeq <8 x double> %wide.load26, %broadcast.splat
%76 = fcmp oeq <8 x double> %wide.load27, %broadcast.splat
%77 = and <8 x i1> %73, %broadcast.splat35
%78 = and <8 x i1> %74, %broadcast.splat35
%79 = and <8 x i1> %75, %broadcast.splat35
%80 = and <8 x i1> %76, %broadcast.splat35
%81 = or <8 x i1> %69, %77
%82 = or <8 x i1> %70, %78
%83 = or <8 x i1> %71, %79
%84 = or <8 x i1> %72, %80
%85 = zext <8 x i1> %81 to <8 x i64>
%86 = zext <8 x i1> %82 to <8 x i64>
%87 = zext <8 x i1> %83 to <8 x i64>
%88 = zext <8 x i1> %84 to <8 x i64>
%89 = add <8 x i64> %vec.phi, %85
%90 = add <8 x i64> %vec.phi22, %86
%91 = add <8 x i64> %vec.phi23, %87
%92 = add <8 x i64> %vec.phi24, %88
%index.next = add nuw i64 %index, 32
%93 = icmp eq i64 %index.next, %n.vec
br i1 %93, label %middle.block, label %vector.body
Seeing that the >(1)
version has a lot of extra operations, maybe having a runtime value is forcing it to do a lot of extra checks, because of possible floating point weirdness.
julia> @btime sum(Base.Fix2(@fastmath(>),1), $mat)
272.517 μs (0 allocations: 0 bytes)
158577
julia> @btime sum(>(1), $mat)
294.637 μs (0 allocations: 0 bytes)
158577
julia> @btime sum(x -> x > 1, $mat)
278.829 μs (0 allocations: 0 bytes)
158577
julia> @btime sum(x -> @fastmath(x > 1), $mat)
274.547 μs (0 allocations: 0 bytes)
158577
@fastmath
helps.