I would try replacing some of the divisions with multiplications like I did. In particular, replace:
A.cedg./=(sum(A.cedg)/(1.0-A.C[h]))
with
A.cedg .*= ((1.0-A.C[h]) / sum(A.cedg))
Multiplication is much faster than addition.
As for vectorization, I don’t believe the version you tried was vectorized.
You can try moving that loop to its own function and microbenchmarking it.
Here, to talk about vectorization, I delete the branches to more simply compare vectorized vs unvectorized versions:
using BenchmarkTools, LoopVectorization, SLEEF
θ = randn(1000); c = randn(1000);
function sumsc_vectorized(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
s, c = 0.0, 0.0
@vvectorize for i ∈ eachindex(θ, coef)
sinθᵢ, cosθᵢ = sincos(θ[i])
s += coef[i] * sinθᵢ
c += coef[i] * cosθᵢ
end
s, c
end
function sumsc_serial(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
s, c = 0.0, 0.0
@inbounds for i ∈ eachindex(θ, coef)
sinθᵢ, cosθᵢ = sincos(θ[i])
s += coef[i] * sinθᵢ
c += coef[i] * cosθᵢ
end
s, c
end
function sumsc_sleef(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
s, c = 0.0, 0.0
@inbounds @simd for i ∈ eachindex(θ, coef)
sinθᵢ, cosθᵢ = SLEEF.sincos_fast(θ[i])
s += coef[i] * sinθᵢ
c += coef[i] * cosθᵢ
end
s, c
end
@btime sumsc_serial($θ, $c)
@btime sumsc_sleef($θ, $c)
@btime sumsc_vectorized($θ, $c)
This yielded
julia> @btime sumsc_serial($θ, $c)
15.532 μs (0 allocations: 0 bytes)
(-15.810428689848422, 21.21581166741435)
julia> @btime sumsc_sleef($θ, $c)
23.051 μs (0 allocations: 0 bytes)
(-15.810428689848425, 21.215811667414357)
julia> @btime sumsc_vectorized($θ, $c)
4.764 μs (0 allocations: 0 bytes)
(-15.810428689848425, 21.21581166741438)
Just from these times, you can tell that SLEEF did not vectorize. But to be sure, you can look at the llvm:
1938 julia> @code_llvm debuginfo=:none sumsc_sleef(θ, c)
1939
1940 define void @julia_sumsc_sleef_19233([2 x double]* noalias nocapture sret, %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
1941 top:
1942 %3 = alloca %jl_value_t addrspace(10)*, i32 3
1943 %4 = alloca <2 x double>, align 16
1944 %tmpcast = bitcast <2 x double>* %4 to [2 x double]*
1945 %5 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
1946 %6 = bitcast %jl_value_t addrspace(11)* %5 to %jl_value_t addrspace(10)* addrspace(11)*
1947 %7 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %6, i64 3
1948 %8 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %7 to i64 addrspace(11)*
1949 %9 = load i64, i64 addrspace(11)* %8, align 8
1950 %10 = icmp sgt i64 %9, 0
1951 %11 = select i1 %10, i64 %9, i64 0
1952 %12 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*
1953 %13 = bitcast %jl_value_t addrspace(11)* %12 to %jl_value_t addrspace(10)* addrspace(11)*
1954 %14 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %13, i64 3
1955 %15 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %14 to i64 addrspace(11)*
1956 %16 = load i64, i64 addrspace(11)* %15, align 8
1957 %17 = icmp sgt i64 %16, 0
1958 %18 = select i1 %17, i64 %16, i64 0
1959 %19 = icmp eq i64 %11, %18
1960 br i1 %19, label %L17, label %L13
1961
1962 L13: ; preds = %top
1963 %20 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i32 0
1964 store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139815113578320 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %20
1965 %21 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i32 1
1966 store %jl_value_t addrspace(10)* %1, %jl_value_t addrspace(10)** %21
1967 %22 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i32 2
1968 store %jl_value_t addrspace(10)* %2, %jl_value_t addrspace(10)** %22
1969 %23 = call nonnull %jl_value_t addrspace(10)* @jl_invoke(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139815178141088 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %3, i32 3, %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139812628328400 to %jl_value_t*) to %jl_value_t addrspace(10)*))
1970 call void @llvm.trap()
1971 unreachable
1972
1973 L17: ; preds = %top
1974 br i1 %10, label %L25.lr.ph, label %L52
1975
1976 L25.lr.ph: ; preds = %L17
1977 %24 = bitcast %jl_value_t addrspace(11)* %5 to double addrspace(13)* addrspace(11)*
1978 %25 = bitcast %jl_value_t addrspace(11)* %12 to double addrspace(13)* addrspace(11)*
1979 br label %L25
1980
1981 L25: ; preds = %L25.lr.ph, %L25
1982 %value_phi217 = phi i64 [ 0, %L25.lr.ph ], [ %38, %L25 ]
1983 %26 = phi <2 x double> [ zeroinitializer, %L25.lr.ph ], [ %37, %L25 ]
1984 %27 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %24, align 8
1985 %28 = getelementptr inbounds double, double addrspace(13)* %27, i64 %value_phi217
1986 %29 = load double, double addrspace(13)* %28, align 8
1987 call void @julia_sincos_fast_19234([2 x double]* noalias nocapture nonnull sret %tmpcast, double %29)
1988 %30 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %25, align 8
1989 %31 = getelementptr inbounds double, double addrspace(13)* %30, i64 %value_phi217
1990 %32 = load double, double addrspace(13)* %31, align 8
1991 %33 = load <2 x double>, <2 x double>* %4, align 16
1992 %34 = insertelement <2 x double> undef, double %32, i32 0
1993 %35 = shufflevector <2 x double> %34, <2 x double> undef, <2 x i32> zeroinitializer
1994 %36 = fmul contract <2 x double> %35, %33
1995 %37 = fadd fast <2 x double> %26, %36
1996 %38 = add nuw nsw i64 %value_phi217, 1
1997 %39 = icmp ult i64 %38, %11
1998 br i1 %39, label %L25, label %L52
1999
2000 L52: ; preds = %L25, %L17
2001 %40 = phi <2 x double> [ zeroinitializer, %L17 ], [ %37, %L25 ]
2002 %41 = bitcast [2 x double]* %0 to <2 x double>*
2003 store <2 x double> %40, <2 x double>* %41, align 8
2004 ret void
2005 }
My copy/paste situation is awkward at work. Please ignore the line numbers. I don’t know how to turn them off, and don’t feel like going through and removing them.
Versus the vectorized version
1830 julia> @code_llvm debuginfo=:none sumsc_vectorized(θ, c)
1831
1832 define void @julia_sumsc_vectorized_18810([2 x double]* noalias nocapture sret, %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
1833 top:
1834 %3 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
1835 %4 = bitcast %jl_value_t addrspace(11)* %3 to %jl_array_t addrspace(11)*
1836 %5 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %4, i64 0, i32 1
1837 %6 = load i64, i64 addrspace(11)* %5, align 8
1838 %7 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*
1839 %8 = bitcast %jl_value_t addrspace(11)* %7 to %jl_array_t addrspace(11)*
1840 %9 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %8, i64 0, i32 1
1841 %10 = load i64, i64 addrspace(11)* %9, align 8
1842 %11 = icmp slt i64 %10, %6
1843 %12 = select i1 %11, i64 %10, i64 %6
1844 %13 = lshr i64 %12, 2
1845 %14 = addrspacecast %jl_value_t addrspace(11)* %7 to %jl_value_t*
1846 %15 = bitcast %jl_value_t* %14 to i64*
1847 %16 = load i64, i64* %15, align 8
1848 %17 = addrspacecast %jl_value_t addrspace(11)* %3 to %jl_value_t*
1849 %18 = bitcast %jl_value_t* %17 to i64*
1850 %19 = load i64, i64* %18, align 8
1851 %20 = icmp eq i64 %13, 0
1852 br i1 %20, label %L223, label %L22.preheader
1853
1854 L22.preheader: ; preds = %top
1855 %21 = inttoptr i64 %19 to i8*
1856 %22 = inttoptr i64 %16 to i8*
1857 br label %L22
1858
1859 L22: ; preds = %L22.preheader, %L22
1860 %value_phi2 = phi <4 x double> [ %res.i96, %L22 ], [ zeroinitializer, %L22.preheader ]
1861 %value_phi3 = phi <4 x double> [ %res.i94, %L22 ], [ zeroinitializer, %L22.preheader ]
1862 %value_phi4 = phi i64 [ %39, %L22 ], [ 0, %L22.preheader ]
1863 %value_phi5 = phi i64 [ %41, %L22 ], [ 1, %L22.preheader ]
1864 %23 = shl i64 %value_phi4, 3
1865 %24 = getelementptr i8, i8* %21, i64 %23
1866 %ptr.i = bitcast i8* %24 to <4 x double>*
1867 %res.i = load <4 x double>, <4 x double>* %ptr.i, align 8
1868 %res.i153 = fmul fast <4 x double> %res.i, <double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883>
1869 %res.i152 = call <4 x double> @llvm.trunc.v4f64(<4 x double> %res.i153)
1870 %res.i151 = fmul fast <4 x double> %res.i, <double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883>
1871 %res.i150 = fmul fast <4 x double> %res.i152, <double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000>
1872 %res.i149 = fsub fast <4 x double> %res.i151, %res.i150
1873 %res.i148 = call <4 x double> @llvm.rint.v4f64(<4 x double> %res.i149)
1874 %res.i147 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i152, <4 x double> <double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000>, <4 x double> %res.i)
1875 %res.i146 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i148, <4 x double> <double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000>, <4 x double> %res.i147)
1876 %res.i145 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i152, <4 x double> <double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000>, <4 x double> %res.i146)
1877 %res.i144 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i148, <4 x double> <double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000>, <4 x double> %res.i145)
1878 %res.i143 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i152, <4 x double> <double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000>, <4 x double> %res.i144)
1879 %res.i142 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i148, <4 x double> <double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000>, <4 x double> %res.i143)
1880 %res.i140 = fadd fast <4 x double> %res.i148, %res.i150
1881 %res.i139 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i140, <4 x double> <double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A>, <4 x double> %res.i142)
1882 %res.i138 = fmul fast <4 x double> %res.i139, %res.i139
1883 %res.i137 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> <double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B>, <4 x double> <double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8>)
1884 %res.i136 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i137, <4 x double> <double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C>)
1885 %res.i135 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i136, <4 x double> <double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A>)
1886 %res.i134 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i135, <4 x double> <double 0x3F8111111110F135, double 0x3F8111111110F135, double 0x3F8111111110F135, double 0x3F8111111110F135>)
1887 %res.i133 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i134, <4 x double> <double 0xBFC5555555555542, double 0xBFC5555555555542, double 0xBFC5555555555542, double 0xBFC5555555555542>)
1888 %res.i132 = fmul fast <4 x double> %res.i138, %res.i139
1889 %res.i131 = fmul fast <4 x double> %res.i132, %res.i133
1890 %res.i130 = fadd fast <4 x double> %res.i131, %res.i139
1891 %res.i129 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> <double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE>, <4 x double> <double 0x3E21EEA016409F05, double 0x3E21EEA016409F05, double 0x3E21EEA016409F05, double 0x3E21EEA016409F05>)
1892 %res.i128 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i129, <4 x double> <double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C>)
1893 %res.i127 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i128, <4 x double> <double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025>)
1894 %res.i126 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i127, <4 x double> <double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96>)
1895 %res.i125 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i126, <4 x double> <double 0x3FA5555555555545, double 0x3FA5555555555545, double 0x3FA5555555555545, double 0x3FA5555555555545>)
1896 %res.i124 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i125, <4 x double> <double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01>)
1897 %res.i123 = fmul fast <4 x double> %res.i124, %res.i138
1898 %res.i122 = fadd fast <4 x double> %res.i123, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
1899 %25 = extractelement <4 x double> %res.i148, i32 0
1900 %26 = fptosi double %25 to i64
1901 %27 = extractelement <4 x double> %res.i148, i32 1
1902 %28 = fptosi double %27 to i64
1903 %29 = extractelement <4 x double> %res.i148, i32 2
1904 %30 = fptosi double %29 to i64
1905 %31 = extractelement <4 x double> %res.i148, i32 3
1906 %32 = fptosi double %31 to i64
1907 %33 = insertelement <4 x i64> undef, i64 %26, i32 0
1908 %34 = insertelement <4 x i64> %33, i64 %28, i32 1
1909 %35 = insertelement <4 x i64> %34, i64 %30, i32 2
1910 %36 = insertelement <4 x i64> %35, i64 %32, i32 3
1911 %37 = trunc <4 x i64> %36 to <4 x i1>
1912 %res.i116 = select <4 x i1> %37, <4 x double> %res.i130, <4 x double> %res.i122
1913 %res.i114 = select <4 x i1> %37, <4 x double> %res.i122, <4 x double> %res.i130
1914 %res.i112 = and <4 x i64> %36, <i64 2, i64 2, i64 2, i64 2>
1915 %res.i110 = icmp eq <4 x i64> %res.i112, zeroinitializer
1916 %res.i109 = fsub fast <4 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i114
1917 %res.i108 = select <4 x i1> %res.i110, <4 x double> %res.i114, <4 x double> %res.i109
1918 %res.i106 = add nsw <4 x i64> %36, <i64 1, i64 1, i64 1, i64 1>
1919 %res.i105 = and <4 x i64> %res.i106, <i64 2, i64 2, i64 2, i64 2>
1920 %res.i103 = icmp eq <4 x i64> %res.i105, zeroinitializer
1921 %res.i102 = fsub fast <4 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i116
1922 %res.i101 = select <4 x i1> %res.i103, <4 x double> %res.i116, <4 x double> %res.i102
1923 %38 = getelementptr i8, i8* %22, i64 %23
1924 %ptr.i98 = bitcast i8* %38 to <4 x double>*
1925 %res.i99 = load <4 x double>, <4 x double>* %ptr.i98, align 8
1926 %res.i97 = fmul fast <4 x double> %res.i108, %res.i99
1927 %res.i96 = fadd fast <4 x double> %res.i97, %value_phi2
1928 %res.i95 = fmul fast <4 x double> %res.i101, %res.i99
1929 %res.i94 = fadd fast <4 x double> %res.i95, %value_phi3
1930 %39 = add nuw i64 %value_phi4, 4
1931 %40 = icmp eq i64 %value_phi5, %13
1932 %41 = add nuw nsw i64 %value_phi5, 1
1933 br i1 %40, label %L223.loopexit, label %L22
1934
1935 L223.loopexit: ; preds = %L22
1936 %42 = shl i64 %12, 3
1937 %phitmp = and i64 %42, -32
1938 br label %L223
1939
1940 L223: ; preds = %L223.loopexit, %top
1941 %value_phi8 = phi <4 x double> [ zeroinitializer, %top ], [ %res.i96, %L223.loopexit ]
1942 %value_phi9 = phi <4 x double> [ zeroinitializer, %top ], [ %res.i94, %L223.loopexit ]
1943 %value_phi10 = phi i64 [ 0, %top ], [ %phitmp, %L223.loopexit ]
1944 %43 = and i64 %12, 3
1945 %44 = icmp eq i64 %43, 0
1946 br i1 %44, label %L427, label %L228
1947
1948 L228: ; preds = %L223
1949 %45 = trunc i64 %43 to i8
1950 %notmask = shl nsw i8 -1, %45
1951 %46 = xor i8 %notmask, 15
1952 %47 = inttoptr i64 %19 to i8*
1953 %48 = getelementptr i8, i8* %47, i64 %value_phi10
1954 %ptr.i90 = bitcast i8* %48 to <4 x double>*
1955 %masktrunc.i91 = trunc i8 %46 to i4
1956 %mask.i92 = bitcast i4 %masktrunc.i91 to <4 x i1>
1957 %res.i93 = call <4 x double> @llvm.masked.load.v4f64.p0v4f64(<4 x double>* %ptr.i90, i32 8, <4 x i1> %mask.i92, <4 x double> zeroinitializer)
1958 %res.i89 = fmul fast <4 x double> %res.i93, <double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883>
1959 %res.i88 = call <4 x double> @llvm.trunc.v4f64(<4 x double> %res.i89)
1960 %res.i87 = fmul fast <4 x double> %res.i93, <double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883>
1961 %res.i86 = fmul fast <4 x double> %res.i88, <double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000>
1962 %res.i85 = fsub fast <4 x double> %res.i87, %res.i86
1963 %res.i84 = call <4 x double> @llvm.rint.v4f64(<4 x double> %res.i85)
1964 %res.i83 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i88, <4 x double> <double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000>, <4 x double> %res.i93)
1965 %res.i82 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i84, <4 x double> <double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000>, <4 x double> %res.i83)
1966 %res.i81 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i88, <4 x double> <double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000>, <4 x double> %res.i82)
1967 %res.i80 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i84, <4 x double> <double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000>, <4 x double> %res.i81)
1968 %res.i79 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i88, <4 x double> <double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000>, <4 x double> %res.i80)
1969 %res.i78 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i84, <4 x double> <double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000>, <4 x double> %res.i79)
1970 %res.i76 = fadd fast <4 x double> %res.i84, %res.i86
1971 %res.i75 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i76, <4 x double> <double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A>, <4 x double> %res.i78)
1972 %res.i74 = fmul fast <4 x double> %res.i75, %res.i75
1973 %res.i73 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> <double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B>, <4 x double> <double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8>)
1974 %res.i72 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i73, <4 x double> <double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C>)
1975 %res.i71 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i72, <4 x double> <double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A>)
1976 %res.i70 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i71, <4 x double> <double 0x3F8111111110F135, double 0x3F8111111110F135, double 0x3F8111111110F135, double 0x3F8111111110F135>)
1977 %res.i69 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i70, <4 x double> <double 0xBFC5555555555542, double 0xBFC5555555555542, double 0xBFC5555555555542, double 0xBFC5555555555542>)
1978 %res.i68 = fmul fast <4 x double> %res.i74, %res.i75
1979 %res.i67 = fmul fast <4 x double> %res.i68, %res.i69
1980 %res.i66 = fadd fast <4 x double> %res.i67, %res.i75
1981 %res.i65 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> <double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE>, <4 x double> <double 0x3E21EEA016409F05, double 0x3E21EEA016409F05, double 0x3E21EEA016409F05, double 0x3E21EEA016409F05>)
1982 %res.i64 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i65, <4 x double> <double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C>)
1983 %res.i63 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i64, <4 x double> <double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025>)
1984 %res.i62 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i63, <4 x double> <double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96>)
1985 %res.i61 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i62, <4 x double> <double 0x3FA5555555555545, double 0x3FA5555555555545, double 0x3FA5555555555545, double 0x3FA5555555555545>)
1986 %res.i60 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i61, <4 x double> <double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01>)
1987 %res.i59 = fmul fast <4 x double> %res.i60, %res.i74
1988 %res.i58 = fadd fast <4 x double> %res.i59, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
1989 %49 = extractelement <4 x double> %res.i84, i32 0
1990 %50 = fptosi double %49 to i64
1991 %51 = extractelement <4 x double> %res.i84, i32 1
1992 %52 = fptosi double %51 to i64
1993 %53 = extractelement <4 x double> %res.i84, i32 2
1994 %54 = fptosi double %53 to i64
1995 %55 = extractelement <4 x double> %res.i84, i32 3
1996 %56 = fptosi double %55 to i64
1997 %57 = insertelement <4 x i64> undef, i64 %50, i32 0
1998 %58 = insertelement <4 x i64> %57, i64 %52, i32 1
1999 %59 = insertelement <4 x i64> %58, i64 %54, i32 2
2000 %60 = insertelement <4 x i64> %59, i64 %56, i32 3
2001 %61 = trunc <4 x i64> %60 to <4 x i1>
2002 %res.i52 = select <4 x i1> %61, <4 x double> %res.i66, <4 x double> %res.i58
2003 %res.i50 = select <4 x i1> %61, <4 x double> %res.i58, <4 x double> %res.i66
2004 %res.i48 = and <4 x i64> %60, <i64 2, i64 2, i64 2, i64 2>
2005 %res.i46 = icmp eq <4 x i64> %res.i48, zeroinitializer
2006 %res.i45 = fsub fast <4 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i50
2007 %res.i44 = select <4 x i1> %res.i46, <4 x double> %res.i50, <4 x double> %res.i45
2008 %res.i42 = add nsw <4 x i64> %60, <i64 1, i64 1, i64 1, i64 1>
2009 %res.i41 = and <4 x i64> %res.i42, <i64 2, i64 2, i64 2, i64 2>
2010 %res.i40 = icmp eq <4 x i64> %res.i41, zeroinitializer
2011 %res.i39 = fsub fast <4 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i52
2012 %res.i38 = select <4 x i1> %res.i40, <4 x double> %res.i52, <4 x double> %res.i39
2013 %62 = inttoptr i64 %16 to i8*
2014 %63 = getelementptr i8, i8* %62, i64 %value_phi10
2015 %ptr.i35 = bitcast i8* %63 to <4 x double>*
2016 %res.i36 = call <4 x double> @llvm.masked.load.v4f64.p0v4f64(<4 x double>* %ptr.i35, i32 8, <4 x i1> %mask.i92, <4 x double> zeroinitializer)
2017 %res.i34 = fmul fast <4 x double> %res.i44, %res.i36
2018 %res.i33 = fadd fast <4 x double> %res.i34, %value_phi8
2019 %res.i32 = select <4 x i1> %mask.i92, <4 x double> %res.i33, <4 x double> %value_phi8
2020 %res.i29 = fmul fast <4 x double> %res.i38, %res.i36
2021 %res.i28 = fadd fast <4 x double> %res.i29, %value_phi9
2022 %res.i27 = select <4 x i1> %mask.i92, <4 x double> %res.i28, <4 x double> %value_phi9
2023 br label %L427
2024
2025 L427: ; preds = %L223, %L228
2026 %value_phi11 = phi <4 x double> [ %res.i32, %L228 ], [ %value_phi8, %L223 ]
2027 %value_phi12 = phi <4 x double> [ %res.i27, %L228 ], [ %value_phi9, %L223 ]
2028 %vec_2_1.i20 = shufflevector <4 x double> %value_phi12, <4 x double> undef, <2 x i32> <i32 0, i32 1>
2029 %vec_2_2.i21 = shufflevector <4 x double> %value_phi12, <4 x double> undef, <2 x i32> <i32 2, i32 3>
2030 %vec_2.i22 = fadd <2 x double> %vec_2_1.i20, %vec_2_2.i21
2031 %vec_1_1.i23 = shufflevector <2 x double> %vec_2.i22, <2 x double> undef, <1 x i32> zeroinitializer
2032 %vec_1_2.i24 = shufflevector <2 x double> %vec_2.i22, <2 x double> undef, <1 x i32> <i32 1>
2033 %vec_1.i25 = fadd <1 x double> %vec_1_1.i23, %vec_1_2.i24
2034 %res.i26 = extractelement <1 x double> %vec_1.i25, i32 0
2035 %vec_2_1.i = shufflevector <4 x double> %value_phi11, <4 x double> undef, <2 x i32> <i32 0, i32 1>
2036 %vec_2_2.i = shufflevector <4 x double> %value_phi11, <4 x double> undef, <2 x i32> <i32 2, i32 3>
2037 %vec_2.i = fadd <2 x double> %vec_2_1.i, %vec_2_2.i
2038 %vec_1_1.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> zeroinitializer
2039 %vec_1_2.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> <i32 1>
2040 %vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i
2041 %res.i19 = extractelement <1 x double> %vec_1.i, i32 0
2042 %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 0
2043 store double %res.i19, double* %.sroa.0.0..sroa_idx, align 8
2044 %.sroa.2.0..sroa_idx14 = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 1
2045 store double %res.i26, double* %.sroa.2.0..sroa_idx14, align 8
2046 ret void
2047 }
The key difference to note here is that the vectorizer version is full of <4 x double>
vectors, as it is calculating 4 loop iterations at a time. The SLEEF version is calculating 1 loop iteration at a time (the [2 x double]
that you see is the sin and cosine).
Note that this vectorized version is incompatible with the if cedg[l] != 0
check, meaning you’d have to remove that check to use it. I already benchmarked it, and found that doing this was much slower than the serial version with the check.
Also, note that LoopVectorization
is actually using basically the exact same code as SLEEF
. The problem with SLEEF is that it does not inline
the sincos_fast
call, meaning the loop has to be a bunch of independent one-at-a-time calls to sincos_fast
. Otherwise, they’re actually running the same implementation.