How to optimise and be faster than Java?

rdeits · September 3, 2018, 1:26am

Are you only running @time once? If so, then you’re also timing the entire compilation process: see this section of the docs.

Or, to put it another way, if you run

@time for x in 1:100000
    calcprojsimdi(pa, N)
end
@time for x in 1:100000
    calcprojsimdi(pa, N)
end

are both results the same? Edit: or just use BenchmarkTools.jl and avoid the issue altogether.

Also, I don’t think it’s necessarily a good idea to @inline all of your functions. Does removing those annotations actually hurt your performance?

rdeits · September 3, 2018, 1:36am

Also, there’s no need to time your function manually running it 100,000 times. Just use GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language :

using BenchmarkTools

@benchmark calcprojsimdi($pa, $N)

BenchmarkTools takes care of ensuring the function is compiled before it is benchmarked and runs it enough times to get a reliable estimate of the execution time.

ChrisRackauckas · September 3, 2018, 1:42am

Unpack the values before the loop and see if it helps the optimizer.

Plamen · September 3, 2018, 2:38am

Hello rdeits

thank you for the suggestions.

Yes, I run them several times (also looked at the section you mention - seems most to be ok), but for 100’000 iterations the code seems to be stable - I receive consistent timings independent of the number of runs.

The lack of @inline actually reduces the performance by about 5% - seems the code is still in the cache.

Plamen · September 3, 2018, 2:41am

About BenchmarkTools - yes - I tried that one too, but a loop of some 100000 iterations gives me 1 to 1 comparison with a similar loop in Java after a prewarmed JVM. (There we have also something similar - called JMH, I know about at least several issues for not using such tools, but my case seems to me yet pretty simple to have just iterations).

Plamen · September 3, 2018, 2:47am

Hello Chris

not sure what you mean with upacking them - the main structure is a struct of arrays. For each iteration I have to get a value of some array, do some arithmetic using values at address T, T-1 or T+1 from other arrays in the struct or the external parameter arrays, and put the value back. As the values from the other arrays are used mostly once per calculation step, unpacking (if I understand correctly) would involve actually copying that values to local variables on the stack and performing them math on them. But as they are used only once, would that not produce actually additional copying without a desired effect around simd/caching?

Would be happy if you elaborate.

With best regards
Plamen

ChrisRackauckas · September 3, 2018, 2:54am

Exactly.

@inline function csi_MarketSpreadAdjustedSpot(s::Proj, N::Int32)
    MarketSpreadAdjustedSpot = s.MarketSpreadAdjustedSpot
    MarketSpreadAdjustedSpot[1] = zero(Float64)
    @simd for T = 2:N
        @inbounds MarketSpreadAdjustedSpot[T] = s.Spot[T] + s.CreditSpread[T] + MarketSpread
    end
end

Etc. I’ve seen the optimizer mess this up before (see @dextorious’s issue for an example of this)

rdeits · September 3, 2018, 2:57am

What Julia version are you using? And are you starting Julia with the -O3 flag?

Plamen · September 3, 2018, 3:01am

If that helps - here is some output from the @code_xxx directives for an example method. I hope there is a hint about something I fundamentally miss. (For example, I see that some SIMD instructions are used, but the setup/cleaup of the method is immense - I would assume that 2 indexes in 2 registers wiht some indirect indexing at least for a non-SIMD version would bring down the conde to may be 2 dozen CPU instructions - here I see much more and not sure why.
I use Julia 1.0 if that matters on a Core I7 CPU from 2015 (macOS) with the following features:

sysctl -a | grep machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

And here the actual function:

@inline function csi_MonthlyInterpolatedZCBAdj(s::Proj, N::Int64)
    s.MonthlyInterpolatedZCBAdj[1] = one(Float64)
    @simd for T = 2:N
        @inbounds s.MonthlyInterpolatedZCBAdj[T] = (s.MonthlyInterpolatedZCBAdj[T-1] / (s.MonthlyRateEachYear2[s.Yr[T]+1] + 1))
    end
    nothing
end

@code_lowered csi_MonthlyInterpolatedZCBAdj(pa, 125)
CodeInfo(
    1 ─       nothing                                                                                                                                             │
238 │   %2  = (Main.one)(Main.Float64)                                                                                                                            │
    │   %3  = (Base.getproperty)(s, :MonthlyInterpolatedZCBAdj)                                                                                                   │
    │         (Base.setindex!)(%3, %2, 1)                                                                                                                         │
239 │         r#402 = 2:N                                                                                                                                         │╻  macro expansion
    │   %6  = Base.simd_outer_range                                                                                                                               ││
    │   %7  = (%6)(r#402)                                                                                                                                         ││
    │         #temp# = (Base.iterate)(%7)                                                                                                                         ││
    │   %9  = #temp# === nothing                                                                                                                                  ││
    │   %10 = (Base.not_int)(%9)                                                                                                                                  ││
    └──       goto #9 if not %10                                                                                                                                  ││
    2 ┄ %12 = #temp#                                                                                                                                              ││
    │         i#403 = (Core.getfield)(%12, 1)                                                                                                                     ││
    │   %14 = (Core.getfield)(%12, 2)                                                                                                                             ││
    │         Core.NewvarNode(:(T@_7))                                                                                                                            ││
    │   %16 = Base.simd_inner_length                                                                                                                              ││
    │   %17 = r#402                                                                                                                                               ││
    │         n#404 = (%16)(%17, i#403)                                                                                                                           ││
    │   %19 = (Main.zero)(n#404)                                                                                                                                  ││
    │   %20 = %19 < n#404                                                                                                                                         ││
    └──       goto #7 if not %20                                                                                                                                  ││
    3 ─       i#405 = (Main.zero)(n#404)                                                                                                                          ││
    4 ┄ %23 = i#405 < n#404                                                                                                                                       ││
    └──       goto #6 if not %23                                                                                                                                  ││
    5 ─ %25 = Base.simd_index                                                                                                                                     ││
    │   %26 = r#402                                                                                                                                               ││
    │   %27 = i#403                                                                                                                                               ││
    │         T@_11 = (%25)(%26, %27, i#405)                                                                                                                      ││
    │         $(Expr(:inbounds, true))                                                                                                                            ││╻  macro expansion
    │   %30 = (Base.getproperty)(s, :MonthlyInterpolatedZCBAdj)                                                                                                   │││
    │   %31 = T@_11 - 1                                                                                                                                           │││
    │   %32 = (Base.getindex)(%30, %31)                                                                                                                           │││
    │   %33 = (Base.getproperty)(s, :MonthlyRateEachYear2)                                                                                                        │││
    │   %34 = (Base.getproperty)(s, :Yr)                                                                                                                          │││
    │   %35 = (Base.getindex)(%34, T@_11)                                                                                                                         │││
    │   %36 = %35 + 1                                                                                                                                             │││
    │   %37 = (Base.getindex)(%33, %36)                                                                                                                           │││
    │   %38 = %37 + 1                                                                                                                                             │││
    │   %39 = %32 / %38                                                                                                                                           │││
    │   %40 = (Base.getproperty)(s, :MonthlyInterpolatedZCBAdj)                                                                                                   │││
    │         (Base.setindex!)(%40, %39, T@_11)                                                                                                                   │││
    │         val = %39                                                                                                                                           │││
    │         $(Expr(:inbounds, :pop))                                                                                                                            │││
    │         val                                                                                                                                                 │││
    │         i#405 = i#405 + 1                                                                                                                                   │││
    │         $(Expr(:simdloop, false))                                                                                                                           │╻  macro expansion
    └──       goto #4                                                                                                                                             ││
    6 ─       T@_7 = (Main.last)(r#402)                                                                                                                           ││
    7 ─       #temp# = (Base.iterate)(%7, %14)                                                                                                                    ││
    │   %50 = #temp# === nothing                                                                                                                                  ││
    │   %51 = (Base.not_int)(%50)                                                                                                                                 ││
    └──       goto #9 if not %51                                                                                                                                  ││
    8 ─       goto #2                                                                                                                                             ││
    9 ─       Main.nothing                                                                                                                                        ││
    └──       return Main.nothing                                                                                                                                 ││
)



@code_llvm csi_MonthlyInterpolatedZCBAdj(pa, 125)

; Function csi_MonthlyInterpolatedZCBAdj
; Location: /Users/ps/git/try-julia-01/try07.jl:238
define void @julia_csi_MonthlyInterpolatedZCBAdj_37744(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(272), i64) {
top:
  %gcframe = alloca %jl_value_t addrspace(10)*, i32 3
  %2 = bitcast %jl_value_t addrspace(10)** %gcframe to i8*
  call void @llvm.memset.p0i8.i32(i8* %2, i8 0, i32 24, i32 0, i1 false)
  %3 = call %jl_value_t*** inttoptr (i64 4364046112 to %jl_value_t*** ()*)() #5
; Function getproperty; {
; Location: sysimg.jl:18
  %4 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 0
  %5 = bitcast %jl_value_t addrspace(10)** %4 to i64*
  store i64 2, i64* %5
  %6 = getelementptr %jl_value_t**, %jl_value_t*** %3, i32 0
  %7 = load %jl_value_t**, %jl_value_t*** %6
  %8 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 1
  %9 = bitcast %jl_value_t addrspace(10)** %8 to %jl_value_t***
  store %jl_value_t** %7, %jl_value_t*** %9
  %10 = bitcast %jl_value_t*** %6 to %jl_value_t addrspace(10)***
  store %jl_value_t addrspace(10)** %gcframe, %jl_value_t addrspace(10)*** %10
  %11 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %12 = bitcast %jl_value_t addrspace(11)* %11 to i8 addrspace(11)*
  %13 = getelementptr inbounds i8, i8 addrspace(11)* %12, i64 88
  %14 = bitcast i8 addrspace(11)* %13 to %jl_value_t addrspace(10)* addrspace(11)*
  %15 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %14, align 8
;}
; Function setindex!; {
; Location: array.jl:769
  %16 = addrspacecast %jl_value_t addrspace(10)* %15 to %jl_value_t addrspace(11)*
  %17 = bitcast %jl_value_t addrspace(11)* %16 to %jl_array_t addrspace(11)*
  %18 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %17, i64 0, i32 1
  %19 = load i64, i64 addrspace(11)* %18, align 8
  %20 = icmp eq i64 %19, 0
  br i1 %20, label %oob, label %idxend

L28:                                              ; preds = %idxend
;}
; Location: /Users/ps/git/try-julia-01/try07.jl:239
; Function macro expansion; {
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function length; {
; Location: range.jl:521
; Function checked_add; {
; Location: checked.jl:170
  call void @julia_throw_overflowerr_binaryop_24265(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4489405088 to %jl_value_t*) to %jl_value_t addrspace(10)*), i64%32, i64 1)
  call void @llvm.trap()
  unreachable

L84:                                              ; preds = %L39.us27, %L33.lr.ph
;}}}
; Location: simdloop.jl:65
  %21 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 1
  %22 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %21
  %23 = getelementptr %jl_value_t**, %jl_value_t*** %3, i32 0
  %24 = bitcast %jl_value_t*** %23 to %jl_value_t addrspace(10)**
  store %jl_value_t addrspace(10)* %22, %jl_value_t addrspace(10)** %24
  ret void

oob:                                              ; preds = %top
;}
; Location: /Users/ps/git/try-julia-01/try07.jl:238
; Function setindex!; {
; Location: array.jl:769
  %25 = alloca i64, align 8
  store i64 1, i64* %25, align 8
  %26 = addrspacecast %jl_value_t addrspace(10)* %15 to %jl_value_t addrspace(12)*
  %27 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 2
  store %jl_value_t addrspace(10)* %15, %jl_value_t addrspace(10)** %27
  call void @jl_bounds_error_ints(%jl_value_t addrspace(12)* %26, i64* nonnull %25, i64 1)
  unreachable

idxend:                                           ; preds = %top
  %28 = bitcast %jl_value_t addrspace(11)* %16 to double addrspace(13)* addrspace(11)*
  %29 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %28, align 8
  store double 1.000000e+00, double addrspace(13)* %29, align 8
;}
; Location: /Users/ps/git/try-julia-01/try07.jl:239
; Function macro expansion; {
; Location: simdloop.jl:65
; Function Colon; {
; Location: range.jl:5
; Function Type; {
; Location: range.jl:255
; Function unitrange_last; {
; Location: range.jl:260
; Function >=; {
; Location: operators.jl:333
; Function <=; {
; Location: int.jl:428
  %30 = icmp sgt i64 %1, 1
;}}
  %31 = select i1 %30, i64 %1, i64 1
;}}}
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function length; {
; Location: range.jl:521
; Function checked_sub; {
; Location: checked.jl:226
; Function sub_with_overflow; {
; Location: checked.jl:198
  %32 = add nsw i64 %31, -2
;}}
; Function checked_add; {
; Location: checked.jl:169
; Function add_with_overflow; {
; Location: checked.jl:136
  %33 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %32, i64 1)
  %34 = extractvalue { i64, i1 } %33, 1
;}
; Location: checked.jl:170
  br i1 %34, label %L28, label %L33.lr.ph

L33.lr.ph:                                        ; preds = %idxend
; Location: checked.jl:169
; Function add_with_overflow; {
; Location: checked.jl:136
  %35 = extractvalue { i64, i1 } %33, 0
  %36 = icmp slt i64 %35, 1
;}
; Location: checked.jl:170
  br i1 %36, label %L84, label %L39.lr.ph.us26

L39.lr.ph.us26:                                   ; preds = %L33.lr.ph
  %37 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %14, align 8
  %38 = addrspacecast %jl_value_t addrspace(10)* %37 to %jl_value_t addrspace(11)*
  %39 = bitcast %jl_value_t addrspace(11)* %38 to double addrspace(13)* addrspace(11)*
  %40 = getelementptr inbounds i8, i8 addrspace(11)* %12, i64 64
  %41 = bitcast i8 addrspace(11)* %40 to %jl_value_t addrspace(10)* addrspace(11)*
  %42 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %41, align 8
  %43 = getelementptr inbounds i8, i8 addrspace(11)* %12, i64 8
  %44 = bitcast i8 addrspace(11)* %43 to %jl_value_t addrspace(10)* addrspace(11)*
  %45 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %44, align 8
  %46 = addrspacecast %jl_value_t addrspace(10)* %45 to %jl_value_t addrspace(11)*
  %47 = bitcast %jl_value_t addrspace(11)* %46 to i64 addrspace(13)* addrspace(11)*
  %48 = addrspacecast %jl_value_t addrspace(10)* %42 to %jl_value_t addrspace(11)*
  %49 = bitcast %jl_value_t addrspace(11)* %48 to double addrspace(13)* addrspace(11)*
  %50 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %39, align 8
  %51 = load i64 addrspace(13)*, i64 addrspace(13)* addrspace(11)* %47, align 8
  %52 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %49, align 8
;}}}
; Location: simdloop.jl:73
; Function macro expansion; {
; Location: /Users/ps/git/try-julia-01/try07.jl:240
; Function getindex; {
; Location: array.jl:731
  %.pre = load double, double addrspace(13)* %50, align 8
;}}
; Location: simdloop.jl:71
  br label %L39.us27

L39.us27:                                         ; preds = %L39.us27, %L39.lr.ph.us26
; Location: simdloop.jl:73
; Function macro expansion; {
; Location: /Users/ps/git/try-julia-01/try07.jl:240
; Function getindex; {
; Location: array.jl:731
  %53 = phi double [ %.pre, %L39.lr.ph.us26 ], [ %60, %L39.us27 ]
  %value_phi421.us28 = phi i64 [ 0, %L39.lr.ph.us26 ], [ %54, %L39.us27 ]
  %54 = add nuw nsw i64 %value_phi421.us28, 1
  %55 = getelementptr inbounds i64, i64 addrspace(13)* %51, i64 %54
  %56 = load i64, i64 addrspace(13)* %55, align 8
  %57 = getelementptr inbounds double, double addrspace(13)* %52, i64 %56
  %58 = load double, double addrspace(13)* %57, align 8
;}
; Function +; {
; Location: promotion.jl:313
; Function +; {
; Location: float.jl:395
  %59 = fadd double %58, 1.000000e+00
;}}
; Function /; {
; Location: float.jl:401
  %60 = fdiv double %53, %59
;}
; Function setindex!; {
; Location: array.jl:769
  %61 = getelementptr inbounds double, double addrspace(13)* %50, i64 %54
  store double %60, double addrspace(13)* %61, align 8
;}}
; Location: simdloop.jl:71
; Function <; {
; Location: int.jl:49
  %62 = icmp ult i64 %54, %35
;}
  br i1 %62, label %L39.us27, label %L84
;}
}

Plamen · September 3, 2018, 3:02am

And the actual assembler produced:

@code_warntype csi_MonthlyInterpolatedZCBAdj(pa, 125)
Body::Nothing
238 1 ── %1  = (Base.getfield)(s, :MonthlyInterpolatedZCBAdj)::Array{Float64,1}                                                                         │╻       getproperty
    │          (Base.arrayset)(true, %1, 1.0, 1)                                                                                                        │╻       setindex!
239 │    %3  = (Base.sle_int)(2, N)::Bool                                                                                                               │╻╷╷╷╷╷  macro expansion
    │          (Base.sub_int)(N, 2)                                                                                                                     ││╻       Colon
    │    %5  = (Base.ifelse)(%3, N, 1)::Int64                                                                                                           │││┃│      Type
    │    %6  = %new(UnitRange{Int64}, 2, %5)::UnitRange{Int64}                                                                                          ││││
    │          (Base.ifelse)(true, 0, -1)                                                                                                               ││╻╷╷╷    simd_outer_range
    │    %8  = (Base.slt_int)(0, 0)::Bool                                                                                                               │││╻╷╷     isempty
    └───       goto #3 if not %8                                                                                                                        │││
    2 ──       goto #4                                                                                                                                  │││
    3 ──       goto #4                                                                                                                                  │││
    4 ┄─ %12 = φ (#2 => true, #3 => false)::Bool                                                                                                        ││
    │    %13 = φ (#3 => 0)::Int64                                                                                                                       ││
    │    %14 = (Base.not_int)(%12)::Bool                                                                                                                ││
    └───       goto #34 if not %14                                                                                                                      ││
    5 ┄─ %16 = φ (#4 => %13, #33 => %79)::Int64                                                                                                         ││
    │    %17 = (Base.Checked.checked_ssub_int)(%5, 2)::Tuple{Int64,Bool}                                                                                ││╻╷╷╷    simd_inner_length
    │    %18 = (Base.getfield)(%17, 1, true)::Int64                                                                                                     │││╻╷      length
    │    %19 = (Base.getfield)(%17, 2, true)::Bool                                                                                                      ││││╻       checked_sub
    └───       goto #7 if not %19                                                                                                                       ││││╻       checked_sub
    6 ──       invoke Base.Checked.throw_overflowerr_binaryop(:-::Symbol, %5::Int64, 2::Int64)                                                          │││││
    └───       $(Expr(:unreachable))                                                                                                                    │││││
    7 ──       goto #8                                                                                                                                  │││││
    8 ── %24 = (Base.Checked.checked_sadd_int)(%18, 1)::Tuple{Int64,Bool}                                                                               │││││╻       add_with_overflow
    │    %25 = (Base.getfield)(%24, 1, true)::Int64                                                                                                     ││││││╻╷      indexed_iterate
    │    %26 = (Base.getfield)(%24, 2, true)::Bool                                                                                                      ││││││╻       getindex
    └───       goto #10 if not %26                                                                                                                      ││││╻       checked_add
    9 ──       invoke Base.Checked.throw_overflowerr_binaryop(:+::Symbol, %18::Int64, 1::Int64)                                                         │││││
    └───       $(Expr(:unreachable))                                                                                                                    │││││
    10 ─       goto #11                                                                                                                                 │││││
    11 ─       goto #12                                                                                                                                 ││││
    12 ─       goto #13                                                                                                                                 │││
    13 ─ %33 = (Base.slt_int)(0, %25)::Bool                                                                                                             ││╻       <
    └───       goto #29 if not %33                                                                                                                      ││
    14 ─       nothing                                                                                                                                  │
    15 ┄ %36 = φ (#14 => 0, #27 => %70)::Int64                                                                                                          ││
    │    %37 = (Base.slt_int)(%36, %25)::Bool                                                                                                           ││╻       <
    └───       goto #28 if not %37                                                                                                                      ││
    16 ─ %39 = (Base.add_int)(%36, 1)::Int64                                                                                                            ││╻╷      simd_index
    │    %40 = (Base.add_int)(2, %39)::Int64                                                                                                            │││╻       getindex
    │    %41 = (Base.sub_int)(%40, 1)::Int64                                                                                                            ││││╻       -
    └───       goto #25 if not false                                                                                                                    │││╻       getindex
    17 ─ %43 = (Base.slt_int)(0, %39)::Bool                                                                                                             ││││╻╷      _in_unit_range
    └───       goto #21 if not %43                                                                                                                      │││││
    18 ─ %45 = (Base.sle_int)(%41, %5)::Bool                                                                                                            │││││╻       <=
    └───       goto #20 if not %45                                                                                                                      │││││
    19 ─ %47 = (Base.sle_int)(2, %41)::Bool                                                                                                             ││││││╻       <=
    └───       goto #22                                                                                                                                 │││││
    20 ─       goto #22                                                                                                                                 │││││
    21 ─       goto #22                                                                                                                                 │││││
    22 ┄ %51 = φ (#19 => %47, #20 => false, #21 => false)::Bool                                                                                         ││││
    └───       goto #24 if not %51                                                                                                                      ││││
    23 ─       goto #25                                                                                                                                 ││││
    24 ─       invoke Base.throw_boundserror(%6::UnitRange{Int64}, %39::Int64)                                                                          ││││
    └───       $(Expr(:unreachable))                                                                                                                    ││││
    25 ┄       goto #26                                                                                                                                 ││││
    26 ─       goto #27                                                                                                                                 ││╻       simd_index
    27 ─ %58 = (Base.getfield)(s, :MonthlyInterpolatedZCBAdj)::Array{Float64,1}                                                                         ││╻╷      macro expansion
    │    %59 = (Base.sub_int)(%41, 1)::Int64                                                                                                            │││╻       -
    │    %60 = (Base.arrayref)(false, %58, %59)::Float64                                                                                                │││╻       getindex
    │    %61 = (Base.getfield)(s, :MonthlyRateEachYear2)::Array{Float64,1}                                                                              │││╻       getproperty
    │    %62 = (Base.getfield)(s, :Yr)::Array{Int64,1}                                                                                                  ││││
    │    %63 = (Base.arrayref)(false, %62, %41)::Int64                                                                                                  │││╻       getindex
    │    %64 = (Base.add_int)(%63, 1)::Int64                                                                                                            │││╻       +
    │    %65 = (Base.arrayref)(false, %61, %64)::Float64                                                                                                │││╻       getindex
    │    %66 = (Base.add_float)(%65, 1.0)::Float64                                                                                                      ││││╻       +
    │    %67 = (Base.div_float)(%60, %66)::Float64                                                                                                      │││╻       /
    │    %68 = (Base.getfield)(s, :MonthlyInterpolatedZCBAdj)::Array{Float64,1}                                                                         │││╻       getproperty
    │          (Base.arrayset)(false, %68, %67, %41)                                                                                                    │││╻       setindex!
    │    %70 = (Base.add_int)(%36, 1)::Int64                                                                                                            │││╻       +
    │          $(Expr(:simdloop, false))                                                                                                                │╻       macro expansion
    └───       goto #15                                                                                                                                 ││
    28 ─       nothing                                                                                                                                  │
    29 ─ %74 = (%16 === 0)::Bool                                                                                                                        ││╻╷      iterate
    └───       goto #31 if not %74                                                                                                                      │││
    30 ─       goto #32                                                                                                                                 │││
    31 ─ %77 = (Base.add_int)(%16, 1)::Int64                                                                                                            │││╻       +
    └───       goto #32                                                                                                                                 ││╻       iterate
    32 ┄ %79 = φ (#31 => %77)::Int64                                                                                                                    ││
    │    %80 = φ (#30 => true, #31 => false)::Bool                                                                                                      ││
    │    %81 = (Base.not_int)(%80)::Bool                                                                                                                ││
    └───       goto #34 if not %81                                                                                                                      ││
    33 ─       goto #5                                                                                                                                  ││
    34 ─       return Main.nothing


@code_native csi_MonthlyInterpolatedZCBAdj(pa, 125)
    .section    __TEXT,__text,regular,pure_instructions
; Function csi_MonthlyInterpolatedZCBAdj {
; Location: try07.jl:238
    pushl   %ebp
    decl    %eax
    movl    %esp, %ebp
    incl    %ecx
    pushl   %esi
    pushl   %ebx
    decl    %eax
    subl    $32, %esp
    decl    %ecx
    movl    %esi, %esi
    decl    %eax
    movl    %edi, %ebx
    vxorpd  %xmm0, %xmm0, %xmm0
    vmovapd %xmm0, -48(%ebp)
    decl    %eax
    movl    $0, -32(%ebp)
    decl    %eax
    movl    $69078816, %eax         ## imm = 0x41E0F20
    addl    %eax, (%eax)
    addb    %al, (%eax)
    calll   *%eax
; Function getproperty; {
; Location: sysimg.jl:18
    decl    %eax
    movl    $2, -48(%ebp)
    decl    %eax
    movl    (%eax), %ecx
    decl    %eax
    movl    %ecx, -40(%ebp)
    decl    %eax
    leal    -48(%ebp), %ecx
    decl    %eax
    movl    %ecx, (%eax)
    decl    %eax
    movl    88(%ebx), %edi
;}
; Function setindex!; {
; Location: array.jl:769
    decl    %eax
    cmpl    $0, 8(%edi)
    je  L221
    decl    %eax
    movl    (%edi), %ecx
    decl    %eax
    movl    $0, %edx
    addb    %al, (%eax)
    lock
    aas
    decl    %eax
    movl    %edx, (%ecx)
;}
; Location: try07.jl:239
; Function macro expansion; {
; Location: simdloop.jl:65
; Function Colon; {
; Location: range.jl:5
; Function Type; {
; Location: range.jl:255
; Function unitrange_last; {
; Location: range.jl:260
; Function >=; {
; Location: operators.jl:333
; Function <=; {
; Location: int.jl:428
    decl    %ebp
    testl   %esi, %esi
    movl    $1, %esi
;}}
    decl    %ecx
    cmovgl  %esi, %esi
;}}}}
; Function macro expansion; {
; Location: checked.jl:198
    decl    %eax
    addl    $-2, %esi
;}
; Function macro expansion; {
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function length; {
; Location: checked.jl:136
    decl    %ecx
    movl    %esi, %eax
    decl    %ecx
    incl    %eax
;}
; Function length; {
; Location: range.jl:521
; Function checked_add; {
; Location: checked.jl:170
    jo  L260
    decl    %ebp
    testl   %eax, %eax
; Location: checked.jl:170
    jle L205
;}}}}
; Function macro expansion; {
; Location: checked.jl
    decl    %eax
    movl    88(%ebx), %edx
    decl    %eax
    movl    8(%ebx), %esi
    decl    %eax
    movl    64(%ebx), %edi
    decl    %eax
    movl    (%edx), %edx
    decl    %eax
    movl    (%esi), %esi
    decl    %eax
    movl    (%edi), %edi
;}
; Function macro expansion; {
; Location: simdloop.jl:73
; Function macro expansion; {
; Location: try07.jl:240
; Function getindex; {
; Location: array.jl:731
    vmovsd  (%edx), %xmm0           ## xmm0 = mem[0],zero
    xorl    %ebx, %ebx
    decl    %eax
    movl    $677370832, %ecx        ## imm = 0x285FDBD0
    addl    %eax, (%eax)
    addb    %al, (%eax)
    vmovsd  (%ecx), %xmm1           ## xmm1 = mem[0],zero
    nopl    (%eax)
L176:
    decl    %eax
    movl    8(%esi,%ebx,8), %ecx
;}}
; Function macro expansion; {
; Location: float.jl:395
    vaddsd  (%edi,%ecx,8), %xmm1, %xmm2
;}
; Function macro expansion; {
; Location: try07.jl:240
; Function /; {
; Location: float.jl:401
    vdivsd  %xmm2, %xmm0, %xmm0
;}
; Function setindex!; {
; Location: array.jl:769
    vmovsd  %xmm0, 8(%edx,%ebx,8)
;}}
; Function macro expansion; {
; Location: array.jl:731
    decl    %eax
    leal    1(%ebx), %ebx
;}}
; Function macro expansion; {
; Location: int.jl:49
    decl    %esp
    cmpl    %eax, %ebx
;}}
; Function csi_MonthlyInterpolatedZCBAdj {
; Location: simdloop.jl:71
    jb  L176
; Location: simdloop.jl:65
L205:
    decl    %eax
    movl    -40(%ebp), %ecx
    decl    %eax
    movl    %ecx, (%eax)
    decl    %eax
    leal    -16(%ebp), %esp
    popl    %ebx
    incl    %ecx
    popl    %esi
    popl    %ebp
    retl
;}
; Function csi_MonthlyInterpolatedZCBAdj {
; Location: try07.jl:238
; Function setindex!; {
; Location: array.jl:769
L221:
    decl    %eax
    movl    %esp, %eax
    decl    %eax
    leal    -16(%eax), %esi
    decl    %eax
    movl    %esi, %esp
    decl    %eax
    movl    $1, -16(%eax)
    decl    %eax
    movl    %edi, -32(%ebp)
    decl    %eax
    movl    $69208848, %eax         ## imm = 0x4200B10
    addl    %eax, (%eax)
    addb    %al, (%eax)
    movl    $1, %edx
    calll   *%eax
;}
; Location: try07.jl:239
; Function macro expansion; {
; Location: simdloop.jl:67
; Function simd_inner_length; {
; Location: simdloop.jl:47
; Function length; {
; Location: range.jl:521
; Function checked_add; {
; Location: checked.jl:170
L260:
    decl    %eax
    movl    $209932240, %eax        ## imm = 0xC834FD0
    addl    %eax, (%eax)
    addb    %al, (%eax)
    decl    %eax
    movl    $194437792, %edi        ## imm = 0xB96E2A0
    addl    %eax, (%eax)
    addb    %al, (%eax)
    movl    $1, %edx
    calll   *%eax
    ud2
    nopw    %cs:(%eax,%eax)
;}}}}}

Plamen · September 3, 2018, 3:20am

v 1.0 with -O3 turned on.

rdeits · September 3, 2018, 3:27am

Ok, that’s good. Versions prior to 0.7 required you to build the system image yourself to get all the SIMD optimizations, but that’s no longer the case.

Elrod · September 3, 2018, 5:23am

I think improperly handing Int32 is causing much of your issues. Consider this test case:

function test1(v)
    @inbounds for i ∈ 1:eltype(v)(length(v))
        v[i] += 1 + i
    end
end
v32 = collect(Int32(1):Int32(10^3)); # Array{Int32,1}
v64 = collect(1:10^3); # Array{Int64,1}

Now…

julia> @benchmark test1($v32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.406 μs (0.00% GC)
  median time:      2.429 μs (0.00% GC)
  mean time:        2.447 μs (0.00% GC)
  maximum time:     4.851 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark test1($v64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     207.042 ns (0.00% GC)
  median time:      210.025 ns (0.00% GC)
  mean time:        216.906 ns (0.00% GC)
  maximum time:     693.890 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     565

Over 10x faster!

Inspecting the @code_typed and @code_native of each version gives you a clue.
Running

julia> @code_typed test1(v32)
CodeInfo(
2 1 ── %1  = (Base.arraylen)(v)::Int64                 │╻      length
  │    %2  = (Core.trunc_int)(Int32, %1)::Int32        ││╻╷     toInt32
  │    %3  = (Core.sext_int)(Int64, %2)::Int64         │││┃      checked_trunc_sint
  │    %4  = (Core.eq_int)(%1, %3)::Bool               ││││   
  └───       goto #3 if not %4                         ││││   
  2 ──       goto #4                                   ││││   
  3 ──       invoke Core.throw_inexacterror(:trunc::Symbol, Int32::Any, %1::Int64)::Union{}
  └───       $(Expr(:unreachable))::Union{}            ││││   
  4 ┄─       goto #5                                   │││    
  5 ──       goto #6                                   ││     
  6 ── %11 = (Core.sext_int)(Core.Int64, %2)::Int64    ││╻╷╷╷╷  promote
  │    %12 = (Base.sle_int)(1, %11)::Bool              │││╻╷╷╷   Type
  │          (Base.sub_int)(%11, 1)::Int64             ││││╻      unitrange_last
  │    %14 = (Base.sub_int)(1, 1)::Int64               │││││┃      -
  │    %15 = (Base.ifelse)(%12, %11, %14)::Int64       │││││  
  │    %16 = (Base.slt_int)(%15, 1)::Bool              ││╻╷╷    isempty
  └───       goto #8 if not %16                        ││     
  7 ──       goto #9                                   ││     
  8 ──       goto #9                                   ││     
  9 ┄─ %20 = φ (#7 => true, #8 => false)::Bool         │      
  │    %21 = φ (#8 => 1)::Int64                        │      
  │    %22 = φ (#8 => 1)::Int64                        │      
  │    %23 = (Base.not_int)(%20)::Bool                 │      
  └───       goto #22 if not %23                       │      
  10 ┄ %25 = φ (#9 => %21, #21 => %48)::Int64          │      
  │    %26 = φ (#9 => %22, #21 => %49)::Int64          │      
3 │    %27 = (Base.arrayref)(false, v, %25)::Int32     │╻      getindex
  │    %28 = (Base.add_int)(1, %25)::Int64             │╻      +
  │    %29 = (Base.sext_int)(Int64, %27)::Int64        ││╻      rem
  │    %30 = (Base.add_int)(%29, %28)::Int64           ││╻      +
  │    %31 = (Core.trunc_int)(Int32, %30)::Int32       ││╻╷╷╷   convert
  │    %32 = (Core.sext_int)(Int64, %31)::Int64        │││┃││    Type
  │    %33 = (Core.eq_int)(%30, %32)::Bool             ││││┃│     toInt32
  └───       goto #12 if not %33                       │││││┃      checked_trunc_sint
  11 ─       goto #13                                  ││││││ 
  12 ─       invoke Core.throw_inexacterror(:trunc::Symbol, Int32::Any, %30::Int64)::Union{}
  └───       $(Expr(:unreachable))::Union{}            ││││││ 
  13 ┄       goto #14                                  │││││  
  14 ─       goto #15                                  ││││   
  15 ─       goto #16                                  │││    
  16 ─       (Base.arrayset)(false, v, %31, %25)::Array{Int32,1}
  └───       goto #17                                  ││     
  17 ─ %43 = (%26 === %15)::Bool                       ││╻      ==
  └───       goto #19 if not %43                       ││     
  18 ─       goto #20                                  ││     
  19 ─ %46 = (Base.add_int)(%26, 1)::Int64             ││╻      +
  └───       goto #20                                  │╻      iterate
  20 ┄ %48 = φ (#19 => %46)::Int64                     │      
  │    %49 = φ (#19 => %46)::Int64                     │      
  │    %50 = φ (#18 => true, #19 => false)::Bool       │      
  │    %51 = (Base.not_int)(%50)::Bool                 │      
  └───       goto #22 if not %51                       │      
  21 ─       goto #10                                  │      
  22 ─       return                                    │      
) => Nothing

Specifically, see the large mix of Int32 and Int64s? The code is constantly converting between the two types, widening and narrowing.

Rather than the full @code_native of both, here is a highly from @code_native test1(v64):

; Location: REPL[14]:2
L40:
	movq	%rax, %r8
	andq	$-16, %r8
	leaq	1(%r8), %rdx
	movabsq	$140319310561344, %rsi  # imm = 0x7F9EA2A94040
	vmovdqa	(%rsi), %ymm0
	leaq	96(%rcx), %rdi
	movabsq	$140319310561376, %rsi  # imm = 0x7F9EA2A94060
	vpbroadcastq	(%rsi), %ymm1
	movabsq	$140319310561384, %rsi  # imm = 0x7F9EA2A94068
	vpbroadcastq	(%rsi), %ymm2
	movabsq	$140319310561392, %rsi  # imm = 0x7F9EA2A94070
	vpbroadcastq	(%rsi), %ymm3
	vpcmpeqd	%ymm4, %ymm4, %ymm4
	movabsq	$140319310561400, %rsi  # imm = 0x7F9EA2A94078
	vpbroadcastq	(%rsi), %ymm5
	movq	%r8, %rsi
	nopl	(%rax,%rax)
; Location: REPL[14]:3
; Function +; {
; Location: int.jl:53
L144:
	vpaddq	-96(%rdi), %ymm0, %ymm6
	vpaddq	-64(%rdi), %ymm0, %ymm7
	vpsubq	%ymm4, %ymm6, %ymm6
	vpaddq	%ymm1, %ymm7, %ymm7
	vpaddq	-32(%rdi), %ymm0, %ymm8
	vpaddq	%ymm2, %ymm8, %ymm8
	vpaddq	(%rdi), %ymm0, %ymm9
	vpaddq	%ymm3, %ymm9, %ymm9
;}
; Function setindex!; {
; Location: array.jl:769
	vmovdqu	%ymm6, -96(%rdi)
	vmovdqu	%ymm7, -64(%rdi)
	vmovdqu	%ymm8, -32(%rdi)
	vmovdqu	%ymm9, (%rdi)
	vpaddq	%ymm5, %ymm0, %ymm0
	subq	$-128, %rdi
	addq	$-16, %rsi
	jne	L144
;}}

Here, you see the code was vectorized (and also partially unrolling so more instructions can be queued at a given time), even though I didn’t even use @simd!
This is a big pile of vector instructions (ymm registers are 256 bits). If you instead do @code_native test1(v32), you will see no such vectorization or partial unrolling.

Remember that on a 64 bit system, integers default to Int64. Therefore every single 1 you have is an Int64. If you want to stick with Int32 for cache reasons, you’ll have to go through and manually replace every instance of an integer type and explicitly specify them as an Int32. Eg,

const N=Int32(125)

...

@inline function csi_T(s::Proj, N::Int32)
    @simd for T = Int32(1):N
        @inbounds s.T[T] = T
    end
end

@inline function csi_Yr(s::Proj, N::Int32)
    s.Yr[1] = zero(Int32)
    @simd for T = Int32(2):N
        # Pick depending on your stance on Unicode.
        # @inbounds s.Yr[T] = div(T, Int32(12)) + Int32(1)
        @inbounds s.Yr[T] = T ÷ Int32(12) + Int32(1)
    end
end

etc. Use @code_typed to make sure your code is free of Int64.

Or convert entirely to Int64. Just stay consistent.

Sukera · September 3, 2018, 7:51am

Adding on to that, you’re also doing a lot of unnecessary conversion with Integer literals to Floats. Take a look at just this Integer division:

julia> @code_warntype 1/5
Body::Float64
59 1 ─ %1 = (Base.sitofp)(Float64, x)::Float64
   │   %2 = (Base.sitofp)(Float64, y)::Float64
   │   %3 = (Base.div_float)(%1, %2)::Float64 
   └──      return %3

# compared with:

julia> @code_warntype 1.0/5.0
Body::Float64
401 1 ─ %1 = (Base.div_float)(x, y)::Float64
    └──      return %1

On its own there might not be a lot lost, but if it’s done a lot it absolutely will hurt performance. In general its best to write float literals when you already know there are only going to be floats involved.

In your example, this

@inline function csi_YearlyZCBMRAdjusted(s::Proj, N::Int32)
    s.YearlyZCBMRAdjusted[1] = one(Float64)
    @simd for T = 2:N
        @inbounds s.YearlyZCBMRAdjusted[T] = 1 / ((1 + s.MarketSpreadAdjustedSpot[T]) ^ (T-1))
    end
end

should look more like this:

@inline function csi_YearlyZCBMRAdjusted(s::Proj, N::Int32)
    s.YearlyZCBMRAdjusted[1] = one(Float64)
    @simd for T = 2:N 
        @inbounds s.YearlyZCBMRAdjusted[T] = 1.0 / ((1.0 + s.MarketSpreadAdjustedSpot[T]) ^ (T-1.0))
    end
end

Sukera · September 3, 2018, 7:11pm

That function is pretty big - I don’t think you want to inline it. You may be leaving compiler optimisations on the table.

Have you tried the steps from the Profiling section? Without looking at your current code, it’s getting hard to tell where potential bottlenecks may be. Specifically how you call calcproj, as that may be just as relevant as the function itself. You’re also referencing a bunch of (what seems to be) non-constant global variables - don’t do that (Yield_GBP, Spreadbands_GBP_Bond, Marketspread,…) (although that might not do anything for you, because Arrays are a mutable data structure and declaring them constant does little for the variables contained within).

I did find some unnecessary casts though:

p.MonthlyRateEachYear = if p.T == 1 
                           0.0 
                        else 
                           ((s[p.T-1].YearlyZCB / p.YearlyZCB) ^ OT) - 1 
                        end

should be

p.MonthlyRateEachYear = if p.T == 1 
                           0.0 
                        else 
                           ((s[p.T-1].YearlyZCB / p.YearlyZCB) ^ OT) - 1.0
                        end

and likewise for p.MonthlyRateEachYear2.

Plamen · September 3, 2018, 8:11pm

Hello Sukera

I tought I spotted all casts, but thank you for pointing that there are still some not fixed! Will reread the code again.

The seemingly global variables - I read about it and already before posting my results they were already constants (but global) in the form of n-dimensional array of either Int64 or Real64 or a const struct value with typed slots (again only Int64 or Real64).

Removing the @inline even in front of the big function actually makes it slightly slower - no idea why. But only slighlty, so - probably this is not the main problem of the speed difference.

Yes the function you mention is big, this is in the scenario where for each T variables are calculated one by one and the storage is an array of structs. Surprisingly this one is faster (and equivalent to the Java version). I haven’t extracted the formulas for the about 30+ variables in separate functions as the whole method is THE hot code, so any analysis without crossing method boundaries the compiler can do I assume would do only good. The opposite solution is with extracted formulas (but also individual loops) in the first posting and this one is slower).

What I see in the @code_llvm are a lot of bounds checks (which at least in the first solution are tried to be suppressed with @inbounds, but even after that there is a lot of checking. So I tried to use in the meantime StaticArrays.jl and FastArrays.jl for the first, column-wise solution, but without success for faster runtime. Looking currently at the SIMD.jl to see if I can do some simding manually, but there are problems with it in the domain itself, so I hoped that the compiler could automate that work better than me But will try.

Hope there are further suggestions.

With best regards

Sukera · September 3, 2018, 8:20pm

What are your conclusions from using the provided profiling capabilities? Is that function alone the bottleneck or are there others? How are you calling that function? Is it called more often than not as an initialization? Maybe a mutating version (of a Proj with the push! later on) would be better instead of always creating (and thus allocating) a fresh Proj object. It may also be possible to create a bunch of new Proj and push! them in one call - thus only needing one resize for the Array instead of however many Proj you create.

From what the code looks like, I think you are calling that function in a loop outside and pushing new Proj objects onto your “history”. Is this correct?

As for the boundchecks, you can try putting in more @inbounds, but I don’t think they’re the majority of the holdup - it’s impossible to tell without knowing how this function is called.

Sukera · September 3, 2018, 9:23pm

If I understand correctly, there isn’t much to SIMD here since each run depends on the previous run.

In julia, by convention functions mutating one of their arguments have a ! appended to make it clear to a user that they’re modifying one of their arguments - this does not have an impact on performance though, just makes the intent clearer.

As for the benchmark, BenchmarkTools allows interpolating of outer variables with $ (like @btime f($x)) - this reduces overhead of retrieving the object. @time in contrast includes compilation time, while @btime doesn’t. Also, you might want to use a setup for the benchmark, not to throw it off by increasing the size of the array evermore (like this: @btime calcprojouter(x, 125) setup=(x = copy($pa))). Be sure to check the documentation of BenchmarkTools for more information.

One idea you could still try without modifying too much is precalculating all the variables before constructing the new Proj and just giving those values to the constructor directly instead of creating a new Proj with all zeros and then mutating it.

DNF · September 3, 2018, 9:26pm

I don’t think this is correct. Perhaps @code_warntype shows more operations, but when doing actual benchmarks, I get no difference between integer and float division. In fact, I always advise people to use integer literals because it’s more general, and you get better use of type promotion. For example 2*x will promote correctly for a large range of different types of x, while 2.0*x will tend to promote to Float64.

Examples:

julia> ai, bi = rand(1:1000, 10^6), rand(1:1000, 10^6);

julia> af, bf = rand(10^6), rand(10^6);

julia> @btime $ai ./ $bi;
  2.863 ms (2 allocations: 7.63 MiB)

julia> @btime $af ./ $bf;
  2.826 ms (2 allocations: 7.63 MiB)

julia> ai, bi = 1, 5
(1, 5)

julia> af, bf = 1.0, 5.0
(1.0, 5.0)

julia> @btime $ai / $bi
  0.022 ns (0 allocations: 0 bytes)
0.2

julia> @btime $af / $bf
  0.022 ns (0 allocations: 0 bytes)
0.2

Topic		Replies	Views
Reducing allocations to optimize (to match Java speed) Performance	22	1339	December 10, 2018
Why does this Python code performs three times faster than Julia? Performance performance	22	1365	October 13, 2023
Understanding Julia performance in simple finite difference code New to Julia	38	4447	June 7, 2018
Julia slower than Matlab & Python? No Performance economics , tensorflow , matlab , pytorch	120	16862	March 16, 2021
Optimizing Calculation in Julia compared to C (New to Julia) Performance	25	2531	January 1, 2020

How to optimise and be faster than Java?

Related topics