This post was temporarily hidden by the community for possibly being off-topic, inappropriate, or spammy.
Are you only running @time
once? If so, then youβre also timing the entire compilation process: see this section of the docs.
Or, to put it another way, if you run
@time for x in 1:100000
calcprojsimdi(pa, N)
end
@time for x in 1:100000
calcprojsimdi(pa, N)
end
are both results the same? Edit: or just use BenchmarkTools.jl and avoid the issue altogether.
Also, I donβt think itβs necessarily a good idea to @inline
all of your functions. Does removing those annotations actually hurt your performance?
Also, thereβs no need to time your function manually running it 100,000 times. Just use https://github.com/JuliaCI/BenchmarkTools.jl :
using BenchmarkTools
@benchmark calcprojsimdi($pa, $N)
BenchmarkTools takes care of ensuring the function is compiled before it is benchmarked and runs it enough times to get a reliable estimate of the execution time.
Unpack the values before the loop and see if it helps the optimizer.
Hello rdeits
thank you for the suggestions.
Yes, I run them several times (also looked at the section you mention - seems most to be ok), but for 100β000 iterations the code seems to be stable - I receive consistent timings independent of the number of runs.
The lack of @inline actually reduces the performance by about 5% - seems the code is still in the cache.
About BenchmarkTools - yes - I tried that one too, but a loop of some 100000 iterations gives me 1 to 1 comparison with a similar loop in Java after a prewarmed JVM. (There we have also something similar - called JMH, I know about at least several issues for not using such tools, but my case seems to me yet pretty simple to have just iterations).
Hello Chris
not sure what you mean with upacking them - the main structure is a struct of arrays. For each iteration I have to get a value of some array, do some arithmetic using values at address T, T-1 or T+1 from other arrays in the struct or the external parameter arrays, and put the value back. As the values from the other arrays are used mostly once per calculation step, unpacking (if I understand correctly) would involve actually copying that values to local variables on the stack and performing them math on them. But as they are used only once, would that not produce actually additional copying without a desired effect around simd/caching?
Would be happy if you elaborate.
With best regards
Plamen
Exactly.
@inline function csi_MarketSpreadAdjustedSpot(s::Proj, N::Int32)
MarketSpreadAdjustedSpot = s.MarketSpreadAdjustedSpot
MarketSpreadAdjustedSpot[1] = zero(Float64)
@simd for T = 2:N
@inbounds MarketSpreadAdjustedSpot[T] = s.Spot[T] + s.CreditSpread[T] + MarketSpread
end
end
Etc. Iβve seen the optimizer mess this up before (see @dextoriousβs issue for an example of this)
What Julia version are you using? And are you starting Julia with the -O3
flag?
If that helps - here is some output from the @code_xxx directives for an example method. I hope there is a hint about something I fundamentally miss. (For example, I see that some SIMD instructions are used, but the setup/cleaup of the method is immense - I would assume that 2 indexes in 2 registers wiht some indirect indexing at least for a non-SIMD version would bring down the conde to may be 2 dozen CPU instructions - here I see much more and not sure why.
I use Julia 1.0 if that matters on a Core I7 CPU from 2015 (macOS) with the following features:
sysctl -a | grep machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
And here the actual function:
@inline function csi_MonthlyInterpolatedZCBAdj(s::Proj, N::Int64)
s.MonthlyInterpolatedZCBAdj[1] = one(Float64)
@simd for T = 2:N
@inbounds s.MonthlyInterpolatedZCBAdj[T] = (s.MonthlyInterpolatedZCBAdj[T-1] / (s.MonthlyRateEachYear2[s.Yr[T]+1] + 1))
end
nothing
end
@code_lowered csi_MonthlyInterpolatedZCBAdj(pa, 125)
CodeInfo(
1 β nothing β
238 β %2 = (Main.one)(Main.Float64) β
β %3 = (Base.getproperty)(s, :MonthlyInterpolatedZCBAdj) β
β (Base.setindex!)(%3, %2, 1) β
239 β r#402 = 2:N ββ» macro expansion
β %6 = Base.simd_outer_range ββ
β %7 = (%6)(r#402) ββ
β #temp# = (Base.iterate)(%7) ββ
β %9 = #temp# === nothing ββ
β %10 = (Base.not_int)(%9) ββ
βββ goto #9 if not %10 ββ
2 β %12 = #temp# ββ
β i#403 = (Core.getfield)(%12, 1) ββ
β %14 = (Core.getfield)(%12, 2) ββ
β Core.NewvarNode(:(T@_7)) ββ
β %16 = Base.simd_inner_length ββ
β %17 = r#402 ββ
β n#404 = (%16)(%17, i#403) ββ
β %19 = (Main.zero)(n#404) ββ
β %20 = %19 < n#404 ββ
βββ goto #7 if not %20 ββ
3 β i#405 = (Main.zero)(n#404) ββ
4 β %23 = i#405 < n#404 ββ
βββ goto #6 if not %23 ββ
5 β %25 = Base.simd_index ββ
β %26 = r#402 ββ
β %27 = i#403 ββ
β T@_11 = (%25)(%26, %27, i#405) ββ
β $(Expr(:inbounds, true)) βββ» macro expansion
β %30 = (Base.getproperty)(s, :MonthlyInterpolatedZCBAdj) βββ
β %31 = T@_11 - 1 βββ
β %32 = (Base.getindex)(%30, %31) βββ
β %33 = (Base.getproperty)(s, :MonthlyRateEachYear2) βββ
β %34 = (Base.getproperty)(s, :Yr) βββ
β %35 = (Base.getindex)(%34, T@_11) βββ
β %36 = %35 + 1 βββ
β %37 = (Base.getindex)(%33, %36) βββ
β %38 = %37 + 1 βββ
β %39 = %32 / %38 βββ
β %40 = (Base.getproperty)(s, :MonthlyInterpolatedZCBAdj) βββ
β (Base.setindex!)(%40, %39, T@_11) βββ
β val = %39 βββ
β $(Expr(:inbounds, :pop)) βββ
β val βββ
β i#405 = i#405 + 1 βββ
β $(Expr(:simdloop, false)) ββ» macro expansion
βββ goto #4 ββ
6 β T@_7 = (Main.last)(r#402) ββ
7 β #temp# = (Base.iterate)(%7, %14) ββ
β %50 = #temp# === nothing ββ
β %51 = (Base.not_int)(%50) ββ
βββ goto #9 if not %51 ββ
8 β goto #2 ββ
9 β Main.nothing ββ
βββ return Main.nothing ββ
)
@code_llvm csi_MonthlyInterpolatedZCBAdj(pa, 125)
; Function csi_MonthlyInterpolatedZCBAdj
; Location: /Users/ps/git/try-julia-01/try07.jl:238
define void @julia_csi_MonthlyInterpolatedZCBAdj_37744(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(272), i64) {
top:
%gcframe = alloca %jl_value_t addrspace(10)*, i32 3
%2 = bitcast %jl_value_t addrspace(10)** %gcframe to i8*
call void @llvm.memset.p0i8.i32(i8* %2, i8 0, i32 24, i32 0, i1 false)
%3 = call %jl_value_t*** inttoptr (i64 4364046112 to %jl_value_t*** ()*)() #5
; Function getproperty; {
; Location: sysimg.jl:18
%4 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 0
%5 = bitcast %jl_value_t addrspace(10)** %4 to i64*
store i64 2, i64* %5
%6 = getelementptr %jl_value_t**, %jl_value_t*** %3, i32 0
%7 = load %jl_value_t**, %jl_value_t*** %6
%8 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 1
%9 = bitcast %jl_value_t addrspace(10)** %8 to %jl_value_t***
store %jl_value_t** %7, %jl_value_t*** %9
%10 = bitcast %jl_value_t*** %6 to %jl_value_t addrspace(10)***
store %jl_value_t addrspace(10)** %gcframe, %jl_value_t addrspace(10)*** %10
%11 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
%12 = bitcast %jl_value_t addrspace(11)* %11 to i8 addrspace(11)*
%13 = getelementptr inbounds i8, i8 addrspace(11)* %12, i64 88
%14 = bitcast i8 addrspace(11)* %13 to %jl_value_t addrspace(10)* addrspace(11)*
%15 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %14, align 8
;}
; Function setindex!; {
; Location: array.jl:769
%16 = addrspacecast %jl_value_t addrspace(10)* %15 to %jl_value_t addrspace(11)*
%17 = bitcast %jl_value_t addrspace(11)* %16 to %jl_array_t addrspace(11)*
%18 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %17, i64 0, i32 1
%19 = load i64, i64 addrspace(11)* %18, align 8
%20 = icmp eq i64 %19, 0
br i1 %20, label %oob, label %idxend
L28: ; preds = %idxend
;}
; Location: /Users/ps/git/try-julia-01/try07.jl:239
; Function macro expansion; {
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function length; {
; Location: range.jl:521
; Function checked_add; {
; Location: checked.jl:170
call void @julia_throw_overflowerr_binaryop_24265(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4489405088 to %jl_value_t*) to %jl_value_t addrspace(10)*), i64%32, i64 1)
call void @llvm.trap()
unreachable
L84: ; preds = %L39.us27, %L33.lr.ph
;}}}
; Location: simdloop.jl:65
%21 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 1
%22 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %21
%23 = getelementptr %jl_value_t**, %jl_value_t*** %3, i32 0
%24 = bitcast %jl_value_t*** %23 to %jl_value_t addrspace(10)**
store %jl_value_t addrspace(10)* %22, %jl_value_t addrspace(10)** %24
ret void
oob: ; preds = %top
;}
; Location: /Users/ps/git/try-julia-01/try07.jl:238
; Function setindex!; {
; Location: array.jl:769
%25 = alloca i64, align 8
store i64 1, i64* %25, align 8
%26 = addrspacecast %jl_value_t addrspace(10)* %15 to %jl_value_t addrspace(12)*
%27 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 2
store %jl_value_t addrspace(10)* %15, %jl_value_t addrspace(10)** %27
call void @jl_bounds_error_ints(%jl_value_t addrspace(12)* %26, i64* nonnull %25, i64 1)
unreachable
idxend: ; preds = %top
%28 = bitcast %jl_value_t addrspace(11)* %16 to double addrspace(13)* addrspace(11)*
%29 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %28, align 8
store double 1.000000e+00, double addrspace(13)* %29, align 8
;}
; Location: /Users/ps/git/try-julia-01/try07.jl:239
; Function macro expansion; {
; Location: simdloop.jl:65
; Function Colon; {
; Location: range.jl:5
; Function Type; {
; Location: range.jl:255
; Function unitrange_last; {
; Location: range.jl:260
; Function >=; {
; Location: operators.jl:333
; Function <=; {
; Location: int.jl:428
%30 = icmp sgt i64 %1, 1
;}}
%31 = select i1 %30, i64 %1, i64 1
;}}}
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function length; {
; Location: range.jl:521
; Function checked_sub; {
; Location: checked.jl:226
; Function sub_with_overflow; {
; Location: checked.jl:198
%32 = add nsw i64 %31, -2
;}}
; Function checked_add; {
; Location: checked.jl:169
; Function add_with_overflow; {
; Location: checked.jl:136
%33 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %32, i64 1)
%34 = extractvalue { i64, i1 } %33, 1
;}
; Location: checked.jl:170
br i1 %34, label %L28, label %L33.lr.ph
L33.lr.ph: ; preds = %idxend
; Location: checked.jl:169
; Function add_with_overflow; {
; Location: checked.jl:136
%35 = extractvalue { i64, i1 } %33, 0
%36 = icmp slt i64 %35, 1
;}
; Location: checked.jl:170
br i1 %36, label %L84, label %L39.lr.ph.us26
L39.lr.ph.us26: ; preds = %L33.lr.ph
%37 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %14, align 8
%38 = addrspacecast %jl_value_t addrspace(10)* %37 to %jl_value_t addrspace(11)*
%39 = bitcast %jl_value_t addrspace(11)* %38 to double addrspace(13)* addrspace(11)*
%40 = getelementptr inbounds i8, i8 addrspace(11)* %12, i64 64
%41 = bitcast i8 addrspace(11)* %40 to %jl_value_t addrspace(10)* addrspace(11)*
%42 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %41, align 8
%43 = getelementptr inbounds i8, i8 addrspace(11)* %12, i64 8
%44 = bitcast i8 addrspace(11)* %43 to %jl_value_t addrspace(10)* addrspace(11)*
%45 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %44, align 8
%46 = addrspacecast %jl_value_t addrspace(10)* %45 to %jl_value_t addrspace(11)*
%47 = bitcast %jl_value_t addrspace(11)* %46 to i64 addrspace(13)* addrspace(11)*
%48 = addrspacecast %jl_value_t addrspace(10)* %42 to %jl_value_t addrspace(11)*
%49 = bitcast %jl_value_t addrspace(11)* %48 to double addrspace(13)* addrspace(11)*
%50 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %39, align 8
%51 = load i64 addrspace(13)*, i64 addrspace(13)* addrspace(11)* %47, align 8
%52 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %49, align 8
;}}}
; Location: simdloop.jl:73
; Function macro expansion; {
; Location: /Users/ps/git/try-julia-01/try07.jl:240
; Function getindex; {
; Location: array.jl:731
%.pre = load double, double addrspace(13)* %50, align 8
;}}
; Location: simdloop.jl:71
br label %L39.us27
L39.us27: ; preds = %L39.us27, %L39.lr.ph.us26
; Location: simdloop.jl:73
; Function macro expansion; {
; Location: /Users/ps/git/try-julia-01/try07.jl:240
; Function getindex; {
; Location: array.jl:731
%53 = phi double [ %.pre, %L39.lr.ph.us26 ], [ %60, %L39.us27 ]
%value_phi421.us28 = phi i64 [ 0, %L39.lr.ph.us26 ], [ %54, %L39.us27 ]
%54 = add nuw nsw i64 %value_phi421.us28, 1
%55 = getelementptr inbounds i64, i64 addrspace(13)* %51, i64 %54
%56 = load i64, i64 addrspace(13)* %55, align 8
%57 = getelementptr inbounds double, double addrspace(13)* %52, i64 %56
%58 = load double, double addrspace(13)* %57, align 8
;}
; Function +; {
; Location: promotion.jl:313
; Function +; {
; Location: float.jl:395
%59 = fadd double %58, 1.000000e+00
;}}
; Function /; {
; Location: float.jl:401
%60 = fdiv double %53, %59
;}
; Function setindex!; {
; Location: array.jl:769
%61 = getelementptr inbounds double, double addrspace(13)* %50, i64 %54
store double %60, double addrspace(13)* %61, align 8
;}}
; Location: simdloop.jl:71
; Function <; {
; Location: int.jl:49
%62 = icmp ult i64 %54, %35
;}
br i1 %62, label %L39.us27, label %L84
;}
}
And the actual assembler produced:
@code_warntype csi_MonthlyInterpolatedZCBAdj(pa, 125)
Body::Nothing
238 1 ββ %1 = (Base.getfield)(s, :MonthlyInterpolatedZCBAdj)::Array{Float64,1} ββ» getproperty
β (Base.arrayset)(true, %1, 1.0, 1) ββ» setindex!
239 β %3 = (Base.sle_int)(2, N)::Bool ββ»β·β·β·β·β· macro expansion
β (Base.sub_int)(N, 2) βββ» Colon
β %5 = (Base.ifelse)(%3, N, 1)::Int64 βββββ Type
β %6 = %new(UnitRange{Int64}, 2, %5)::UnitRange{Int64} ββββ
β (Base.ifelse)(true, 0, -1) βββ»β·β·β· simd_outer_range
β %8 = (Base.slt_int)(0, 0)::Bool ββββ»β·β· isempty
ββββ goto #3 if not %8 βββ
2 ββ goto #4 βββ
3 ββ goto #4 βββ
4 ββ %12 = Ο (#2 => true, #3 => false)::Bool ββ
β %13 = Ο (#3 => 0)::Int64 ββ
β %14 = (Base.not_int)(%12)::Bool ββ
ββββ goto #34 if not %14 ββ
5 ββ %16 = Ο (#4 => %13, #33 => %79)::Int64 ββ
β %17 = (Base.Checked.checked_ssub_int)(%5, 2)::Tuple{Int64,Bool} βββ»β·β·β· simd_inner_length
β %18 = (Base.getfield)(%17, 1, true)::Int64 ββββ»β· length
β %19 = (Base.getfield)(%17, 2, true)::Bool βββββ» checked_sub
ββββ goto #7 if not %19 βββββ» checked_sub
6 ββ invoke Base.Checked.throw_overflowerr_binaryop(:-::Symbol, %5::Int64, 2::Int64) βββββ
ββββ $(Expr(:unreachable)) βββββ
7 ββ goto #8 βββββ
8 ββ %24 = (Base.Checked.checked_sadd_int)(%18, 1)::Tuple{Int64,Bool} ββββββ» add_with_overflow
β %25 = (Base.getfield)(%24, 1, true)::Int64 βββββββ»β· indexed_iterate
β %26 = (Base.getfield)(%24, 2, true)::Bool βββββββ» getindex
ββββ goto #10 if not %26 βββββ» checked_add
9 ββ invoke Base.Checked.throw_overflowerr_binaryop(:+::Symbol, %18::Int64, 1::Int64) βββββ
ββββ $(Expr(:unreachable)) βββββ
10 β goto #11 βββββ
11 β goto #12 ββββ
12 β goto #13 βββ
13 β %33 = (Base.slt_int)(0, %25)::Bool βββ» <
ββββ goto #29 if not %33 ββ
14 β nothing β
15 β %36 = Ο (#14 => 0, #27 => %70)::Int64 ββ
β %37 = (Base.slt_int)(%36, %25)::Bool βββ» <
ββββ goto #28 if not %37 ββ
16 β %39 = (Base.add_int)(%36, 1)::Int64 βββ»β· simd_index
β %40 = (Base.add_int)(2, %39)::Int64 ββββ» getindex
β %41 = (Base.sub_int)(%40, 1)::Int64 βββββ» -
ββββ goto #25 if not false ββββ» getindex
17 β %43 = (Base.slt_int)(0, %39)::Bool βββββ»β· _in_unit_range
ββββ goto #21 if not %43 βββββ
18 β %45 = (Base.sle_int)(%41, %5)::Bool ββββββ» <=
ββββ goto #20 if not %45 βββββ
19 β %47 = (Base.sle_int)(2, %41)::Bool βββββββ» <=
ββββ goto #22 βββββ
20 β goto #22 βββββ
21 β goto #22 βββββ
22 β %51 = Ο (#19 => %47, #20 => false, #21 => false)::Bool ββββ
ββββ goto #24 if not %51 ββββ
23 β goto #25 ββββ
24 β invoke Base.throw_boundserror(%6::UnitRange{Int64}, %39::Int64) ββββ
ββββ $(Expr(:unreachable)) ββββ
25 β goto #26 ββββ
26 β goto #27 βββ» simd_index
27 β %58 = (Base.getfield)(s, :MonthlyInterpolatedZCBAdj)::Array{Float64,1} βββ»β· macro expansion
β %59 = (Base.sub_int)(%41, 1)::Int64 ββββ» -
β %60 = (Base.arrayref)(false, %58, %59)::Float64 ββββ» getindex
β %61 = (Base.getfield)(s, :MonthlyRateEachYear2)::Array{Float64,1} ββββ» getproperty
β %62 = (Base.getfield)(s, :Yr)::Array{Int64,1} ββββ
β %63 = (Base.arrayref)(false, %62, %41)::Int64 ββββ» getindex
β %64 = (Base.add_int)(%63, 1)::Int64 ββββ» +
β %65 = (Base.arrayref)(false, %61, %64)::Float64 ββββ» getindex
β %66 = (Base.add_float)(%65, 1.0)::Float64 βββββ» +
β %67 = (Base.div_float)(%60, %66)::Float64 ββββ» /
β %68 = (Base.getfield)(s, :MonthlyInterpolatedZCBAdj)::Array{Float64,1} ββββ» getproperty
β (Base.arrayset)(false, %68, %67, %41) ββββ» setindex!
β %70 = (Base.add_int)(%36, 1)::Int64 ββββ» +
β $(Expr(:simdloop, false)) ββ» macro expansion
ββββ goto #15 ββ
28 β nothing β
29 β %74 = (%16 === 0)::Bool βββ»β· iterate
ββββ goto #31 if not %74 βββ
30 β goto #32 βββ
31 β %77 = (Base.add_int)(%16, 1)::Int64 ββββ» +
ββββ goto #32 βββ» iterate
32 β %79 = Ο (#31 => %77)::Int64 ββ
β %80 = Ο (#30 => true, #31 => false)::Bool ββ
β %81 = (Base.not_int)(%80)::Bool ββ
ββββ goto #34 if not %81 ββ
33 β goto #5 ββ
34 β return Main.nothing
@code_native csi_MonthlyInterpolatedZCBAdj(pa, 125)
.section __TEXT,__text,regular,pure_instructions
; Function csi_MonthlyInterpolatedZCBAdj {
; Location: try07.jl:238
pushl %ebp
decl %eax
movl %esp, %ebp
incl %ecx
pushl %esi
pushl %ebx
decl %eax
subl $32, %esp
decl %ecx
movl %esi, %esi
decl %eax
movl %edi, %ebx
vxorpd %xmm0, %xmm0, %xmm0
vmovapd %xmm0, -48(%ebp)
decl %eax
movl $0, -32(%ebp)
decl %eax
movl $69078816, %eax ## imm = 0x41E0F20
addl %eax, (%eax)
addb %al, (%eax)
calll *%eax
; Function getproperty; {
; Location: sysimg.jl:18
decl %eax
movl $2, -48(%ebp)
decl %eax
movl (%eax), %ecx
decl %eax
movl %ecx, -40(%ebp)
decl %eax
leal -48(%ebp), %ecx
decl %eax
movl %ecx, (%eax)
decl %eax
movl 88(%ebx), %edi
;}
; Function setindex!; {
; Location: array.jl:769
decl %eax
cmpl $0, 8(%edi)
je L221
decl %eax
movl (%edi), %ecx
decl %eax
movl $0, %edx
addb %al, (%eax)
lock
aas
decl %eax
movl %edx, (%ecx)
;}
; Location: try07.jl:239
; Function macro expansion; {
; Location: simdloop.jl:65
; Function Colon; {
; Location: range.jl:5
; Function Type; {
; Location: range.jl:255
; Function unitrange_last; {
; Location: range.jl:260
; Function >=; {
; Location: operators.jl:333
; Function <=; {
; Location: int.jl:428
decl %ebp
testl %esi, %esi
movl $1, %esi
;}}
decl %ecx
cmovgl %esi, %esi
;}}}}
; Function macro expansion; {
; Location: checked.jl:198
decl %eax
addl $-2, %esi
;}
; Function macro expansion; {
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function length; {
; Location: checked.jl:136
decl %ecx
movl %esi, %eax
decl %ecx
incl %eax
;}
; Function length; {
; Location: range.jl:521
; Function checked_add; {
; Location: checked.jl:170
jo L260
decl %ebp
testl %eax, %eax
; Location: checked.jl:170
jle L205
;}}}}
; Function macro expansion; {
; Location: checked.jl
decl %eax
movl 88(%ebx), %edx
decl %eax
movl 8(%ebx), %esi
decl %eax
movl 64(%ebx), %edi
decl %eax
movl (%edx), %edx
decl %eax
movl (%esi), %esi
decl %eax
movl (%edi), %edi
;}
; Function macro expansion; {
; Location: simdloop.jl:73
; Function macro expansion; {
; Location: try07.jl:240
; Function getindex; {
; Location: array.jl:731
vmovsd (%edx), %xmm0 ## xmm0 = mem[0],zero
xorl %ebx, %ebx
decl %eax
movl $677370832, %ecx ## imm = 0x285FDBD0
addl %eax, (%eax)
addb %al, (%eax)
vmovsd (%ecx), %xmm1 ## xmm1 = mem[0],zero
nopl (%eax)
L176:
decl %eax
movl 8(%esi,%ebx,8), %ecx
;}}
; Function macro expansion; {
; Location: float.jl:395
vaddsd (%edi,%ecx,8), %xmm1, %xmm2
;}
; Function macro expansion; {
; Location: try07.jl:240
; Function /; {
; Location: float.jl:401
vdivsd %xmm2, %xmm0, %xmm0
;}
; Function setindex!; {
; Location: array.jl:769
vmovsd %xmm0, 8(%edx,%ebx,8)
;}}
; Function macro expansion; {
; Location: array.jl:731
decl %eax
leal 1(%ebx), %ebx
;}}
; Function macro expansion; {
; Location: int.jl:49
decl %esp
cmpl %eax, %ebx
;}}
; Function csi_MonthlyInterpolatedZCBAdj {
; Location: simdloop.jl:71
jb L176
; Location: simdloop.jl:65
L205:
decl %eax
movl -40(%ebp), %ecx
decl %eax
movl %ecx, (%eax)
decl %eax
leal -16(%ebp), %esp
popl %ebx
incl %ecx
popl %esi
popl %ebp
retl
;}
; Function csi_MonthlyInterpolatedZCBAdj {
; Location: try07.jl:238
; Function setindex!; {
; Location: array.jl:769
L221:
decl %eax
movl %esp, %eax
decl %eax
leal -16(%eax), %esi
decl %eax
movl %esi, %esp
decl %eax
movl $1, -16(%eax)
decl %eax
movl %edi, -32(%ebp)
decl %eax
movl $69208848, %eax ## imm = 0x4200B10
addl %eax, (%eax)
addb %al, (%eax)
movl $1, %edx
calll *%eax
;}
; Location: try07.jl:239
; Function macro expansion; {
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function length; {
; Location: range.jl:521
; Function checked_add; {
; Location: checked.jl:170
L260:
decl %eax
movl $209932240, %eax ## imm = 0xC834FD0
addl %eax, (%eax)
addb %al, (%eax)
decl %eax
movl $194437792, %edi ## imm = 0xB96E2A0
addl %eax, (%eax)
addb %al, (%eax)
movl $1, %edx
calll *%eax
ud2
nopw %cs:(%eax,%eax)
;}}}}}
v 1.0 with -O3 turned on.
Ok, thatβs good. Versions prior to 0.7 required you to build the system image yourself to get all the SIMD optimizations, but thatβs no longer the case.
I think improperly handing Int32
is causing much of your issues. Consider this test case:
function test1(v)
@inbounds for i β 1:eltype(v)(length(v))
v[i] += 1 + i
end
end
v32 = collect(Int32(1):Int32(10^3)); # Array{Int32,1}
v64 = collect(1:10^3); # Array{Int64,1}
Nowβ¦
julia> @benchmark test1($v32)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.406 ΞΌs (0.00% GC)
median time: 2.429 ΞΌs (0.00% GC)
mean time: 2.447 ΞΌs (0.00% GC)
maximum time: 4.851 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark test1($v64)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 207.042 ns (0.00% GC)
median time: 210.025 ns (0.00% GC)
mean time: 216.906 ns (0.00% GC)
maximum time: 693.890 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 565
Over 10x faster!
Inspecting the @code_typed
and @code_native
of each version gives you a clue.
Running
julia> @code_typed test1(v32)
CodeInfo(
2 1 ββ %1 = (Base.arraylen)(v)::Int64 ββ» length
β %2 = (Core.trunc_int)(Int32, %1)::Int32 βββ»β· toInt32
β %3 = (Core.sext_int)(Int64, %2)::Int64 ββββ checked_trunc_sint
β %4 = (Core.eq_int)(%1, %3)::Bool ββββ
ββββ goto #3 if not %4 ββββ
2 ββ goto #4 ββββ
3 ββ invoke Core.throw_inexacterror(:trunc::Symbol, Int32::Any, %1::Int64)::Union{}
ββββ $(Expr(:unreachable))::Union{} ββββ
4 ββ goto #5 βββ
5 ββ goto #6 ββ
6 ββ %11 = (Core.sext_int)(Core.Int64, %2)::Int64 βββ»β·β·β·β· promote
β %12 = (Base.sle_int)(1, %11)::Bool ββββ»β·β·β· Type
β (Base.sub_int)(%11, 1)::Int64 βββββ» unitrange_last
β %14 = (Base.sub_int)(1, 1)::Int64 ββββββ -
β %15 = (Base.ifelse)(%12, %11, %14)::Int64 βββββ
β %16 = (Base.slt_int)(%15, 1)::Bool βββ»β·β· isempty
ββββ goto #8 if not %16 ββ
7 ββ goto #9 ββ
8 ββ goto #9 ββ
9 ββ %20 = Ο (#7 => true, #8 => false)::Bool β
β %21 = Ο (#8 => 1)::Int64 β
β %22 = Ο (#8 => 1)::Int64 β
β %23 = (Base.not_int)(%20)::Bool β
ββββ goto #22 if not %23 β
10 β %25 = Ο (#9 => %21, #21 => %48)::Int64 β
β %26 = Ο (#9 => %22, #21 => %49)::Int64 β
3 β %27 = (Base.arrayref)(false, v, %25)::Int32 ββ» getindex
β %28 = (Base.add_int)(1, %25)::Int64 ββ» +
β %29 = (Base.sext_int)(Int64, %27)::Int64 βββ» rem
β %30 = (Base.add_int)(%29, %28)::Int64 βββ» +
β %31 = (Core.trunc_int)(Int32, %30)::Int32 βββ»β·β·β· convert
β %32 = (Core.sext_int)(Int64, %31)::Int64 ββββββ Type
β %33 = (Core.eq_int)(%30, %32)::Bool ββββββ toInt32
ββββ goto #12 if not %33 ββββββ checked_trunc_sint
11 β goto #13 ββββββ
12 β invoke Core.throw_inexacterror(:trunc::Symbol, Int32::Any, %30::Int64)::Union{}
ββββ $(Expr(:unreachable))::Union{} ββββββ
13 β goto #14 βββββ
14 β goto #15 ββββ
15 β goto #16 βββ
16 β (Base.arrayset)(false, v, %31, %25)::Array{Int32,1}
ββββ goto #17 ββ
17 β %43 = (%26 === %15)::Bool βββ» ==
ββββ goto #19 if not %43 ββ
18 β goto #20 ββ
19 β %46 = (Base.add_int)(%26, 1)::Int64 βββ» +
ββββ goto #20 ββ» iterate
20 β %48 = Ο (#19 => %46)::Int64 β
β %49 = Ο (#19 => %46)::Int64 β
β %50 = Ο (#18 => true, #19 => false)::Bool β
β %51 = (Base.not_int)(%50)::Bool β
ββββ goto #22 if not %51 β
21 β goto #10 β
22 β return β
) => Nothing
Specifically, see the large mix of Int32 and Int64s? The code is constantly converting between the two types, widening and narrowing.
Rather than the full @code_native
of both, here is a highly from @code_native test1(v64)
:
; Location: REPL[14]:2
L40:
movq %rax, %r8
andq $-16, %r8
leaq 1(%r8), %rdx
movabsq $140319310561344, %rsi # imm = 0x7F9EA2A94040
vmovdqa (%rsi), %ymm0
leaq 96(%rcx), %rdi
movabsq $140319310561376, %rsi # imm = 0x7F9EA2A94060
vpbroadcastq (%rsi), %ymm1
movabsq $140319310561384, %rsi # imm = 0x7F9EA2A94068
vpbroadcastq (%rsi), %ymm2
movabsq $140319310561392, %rsi # imm = 0x7F9EA2A94070
vpbroadcastq (%rsi), %ymm3
vpcmpeqd %ymm4, %ymm4, %ymm4
movabsq $140319310561400, %rsi # imm = 0x7F9EA2A94078
vpbroadcastq (%rsi), %ymm5
movq %r8, %rsi
nopl (%rax,%rax)
; Location: REPL[14]:3
; Function +; {
; Location: int.jl:53
L144:
vpaddq -96(%rdi), %ymm0, %ymm6
vpaddq -64(%rdi), %ymm0, %ymm7
vpsubq %ymm4, %ymm6, %ymm6
vpaddq %ymm1, %ymm7, %ymm7
vpaddq -32(%rdi), %ymm0, %ymm8
vpaddq %ymm2, %ymm8, %ymm8
vpaddq (%rdi), %ymm0, %ymm9
vpaddq %ymm3, %ymm9, %ymm9
;}
; Function setindex!; {
; Location: array.jl:769
vmovdqu %ymm6, -96(%rdi)
vmovdqu %ymm7, -64(%rdi)
vmovdqu %ymm8, -32(%rdi)
vmovdqu %ymm9, (%rdi)
vpaddq %ymm5, %ymm0, %ymm0
subq $-128, %rdi
addq $-16, %rsi
jne L144
;}}
Here, you see the code was vectorized (and also partially unrolling so more instructions can be queued at a given time), even though I didnβt even use @simd
!
This is a big pile of vector instructions (ymm
registers are 256 bits). If you instead do @code_native test1(v32)
, you will see no such vectorization or partial unrolling.
Remember that on a 64 bit system, integers default to Int64
. Therefore every single 1
you have is an Int64
. If you want to stick with Int32
for cache reasons, youβll have to go through and manually replace every instance of an integer type and explicitly specify them as an Int32. Eg,
const N=Int32(125)
...
@inline function csi_T(s::Proj, N::Int32)
@simd for T = Int32(1):N
@inbounds s.T[T] = T
end
end
@inline function csi_Yr(s::Proj, N::Int32)
s.Yr[1] = zero(Int32)
@simd for T = Int32(2):N
# Pick depending on your stance on Unicode.
# @inbounds s.Yr[T] = div(T, Int32(12)) + Int32(1)
@inbounds s.Yr[T] = T Γ· Int32(12) + Int32(1)
end
end
etc. Use @code_typed
to make sure your code is free of Int64
.
Or convert entirely to Int64. Just stay consistent.
Adding on to that, youβre also doing a lot of unnecessary conversion with Integer literals to Floats. Take a look at just this Integer division:
julia> @code_warntype 1/5
Body::Float64
59 1 β %1 = (Base.sitofp)(Float64, x)::Float64
β %2 = (Base.sitofp)(Float64, y)::Float64
β %3 = (Base.div_float)(%1, %2)::Float64
βββ return %3
# compared with:
julia> @code_warntype 1.0/5.0
Body::Float64
401 1 β %1 = (Base.div_float)(x, y)::Float64
βββ return %1
On its own there might not be a lot lost, but if itβs done a lot it absolutely will hurt performance. In general its best to write float literals when you already know there are only going to be floats involved.
In your example, this
@inline function csi_YearlyZCBMRAdjusted(s::Proj, N::Int32)
s.YearlyZCBMRAdjusted[1] = one(Float64)
@simd for T = 2:N
@inbounds s.YearlyZCBMRAdjusted[T] = 1 / ((1 + s.MarketSpreadAdjustedSpot[T]) ^ (T-1))
end
end
should look more like this:
@inline function csi_YearlyZCBMRAdjusted(s::Proj, N::Int32)
s.YearlyZCBMRAdjusted[1] = one(Float64)
@simd for T = 2:N
@inbounds s.YearlyZCBMRAdjusted[T] = 1.0 / ((1.0 + s.MarketSpreadAdjustedSpot[T]) ^ (T-1.0))
end
end
That function is pretty big - I donβt think you want to inline it. You may be leaving compiler optimisations on the table.
Have you tried the steps from the Profiling section? Without looking at your current code, itβs getting hard to tell where potential bottlenecks may be. Specifically how you call calcproj
, as that may be just as relevant as the function itself. Youβre also referencing a bunch of (what seems to be) non-constant global variables - donβt do that (Yield_GBP
, Spreadbands_GBP_Bond
, Marketspread
,β¦) (although that might not do anything for you, because Arrays are a mutable data structure and declaring them constant does little for the variables contained within).
I did find some unnecessary casts though:
p.MonthlyRateEachYear = if p.T == 1
0.0
else
((s[p.T-1].YearlyZCB / p.YearlyZCB) ^ OT) - 1
end
should be
p.MonthlyRateEachYear = if p.T == 1
0.0
else
((s[p.T-1].YearlyZCB / p.YearlyZCB) ^ OT) - 1.0
end
and likewise for p.MonthlyRateEachYear2
.
Hello Sukera
I tought I spotted all casts, but thank you for pointing that there are still some not fixed! Will reread the code again.
The seemingly global variables - I read about it and already before posting my results they were already constants (but global) in the form of n-dimensional array of either Int64 or Real64 or a const struct value with typed slots (again only Int64 or Real64).
Removing the @inline even in front of the big function actually makes it slightly slower - no idea why. But only slighlty, so - probably this is not the main problem of the speed difference.
Yes the function you mention is big, this is in the scenario where for each T variables are calculated one by one and the storage is an array of structs. Surprisingly this one is faster (and equivalent to the Java version). I havenβt extracted the formulas for the about 30+ variables in separate functions as the whole method is THE hot code, so any analysis without crossing method boundaries the compiler can do I assume would do only good. The opposite solution is with extracted formulas (but also individual loops) in the first posting and this one is slower).
What I see in the @code_llvm are a lot of bounds checks (which at least in the first solution are tried to be suppressed with @inbounds, but even after that there is a lot of checking. So I tried to use in the meantime StaticArrays.jl and FastArrays.jl for the first, column-wise solution, but without success for faster runtime. Looking currently at the SIMD.jl to see if I can do some simding manually, but there are problems with it in the domain itself, so I hoped that the compiler could automate that work better than me But will try.
Hope there are further suggestions.
With best regards
What are your conclusions from using the provided profiling capabilities? Is that function alone the bottleneck or are there others? How are you calling that function? Is it called more often than not as an initialization? Maybe a mutating version (of a Proj
with the push!
later on) would be better instead of always creating (and thus allocating) a fresh Proj
object. It may also be possible to create a bunch of new Proj
and push!
them in one call - thus only needing one resize for the Array instead of however many Proj
you create.
From what the code looks like, I think you are calling that function in a loop outside and pushing new Proj
objects onto your βhistoryβ. Is this correct?
As for the boundchecks, you can try putting in more @inbounds
, but I donβt think theyβre the majority of the holdup - itβs impossible to tell without knowing how this function is called.
If I understand correctly, there isnβt much to SIMD here since each run depends on the previous run.
In julia, by convention functions mutating one of their arguments have a !
appended to make it clear to a user that theyβre modifying one of their arguments - this does not have an impact on performance though, just makes the intent clearer.
As for the benchmark, BenchmarkTools allows interpolating of outer variables with $
(like @btime f($x)
) - this reduces overhead of retrieving the object. @time
in contrast includes compilation time, while @btime doesnβt. Also, you might want to use a setup for the benchmark, not to throw it off by increasing the size of the array evermore (like this: @btime calcprojouter(x, 125) setup=(x = copy($pa))
). Be sure to check the documentation of BenchmarkTools for more information.
One idea you could still try without modifying too much is precalculating all the variables before constructing the new Proj
and just giving those values to the constructor directly instead of creating a new Proj
with all zeros and then mutating it.
I donβt think this is correct. Perhaps @code_warntype
shows more operations, but when doing actual benchmarks, I get no difference between integer and float division. In fact, I always advise people to use integer literals because itβs more general, and you get better use of type promotion. For example 2*x
will promote correctly for a large range of different types of x
, while 2.0*x
will tend to promote to Float64
.
Examples:
julia> ai, bi = rand(1:1000, 10^6), rand(1:1000, 10^6);
julia> af, bf = rand(10^6), rand(10^6);
julia> @btime $ai ./ $bi;
2.863 ms (2 allocations: 7.63 MiB)
julia> @btime $af ./ $bf;
2.826 ms (2 allocations: 7.63 MiB)
julia> ai, bi = 1, 5
(1, 5)
julia> af, bf = 1.0, 5.0
(1.0, 5.0)
julia> @btime $ai / $bi
0.022 ns (0 allocations: 0 bytes)
0.2
julia> @btime $af / $bf
0.022 ns (0 allocations: 0 bytes)
0.2