 # Repeatedly multiplying a Float64 is fastest in chunks of <51 terms

I’ve written a fast Julia-only permutation (n P r) function intended for perms small enough to fit into a Float64 (at the expense of some precision). Thus, if a permutation doesn’t exceed 1.7976931348623157e308, one can avoid slow BigInts.

However, much of its speed relies on a magic threshold whose universality I’m unsure of. On my machine (with Julia 1.4), computing a permutation of length r>50 is faster when split into chunks of length 50 or less. For instance, (1000 P 100) = 1000⋅999⋅998⋅…⋅901, which exceeds typemax(Int128). I define a Float64 to store/update the result when multiplied by each Int64 term one-by-one in a loop. Curiously, it’s faster to use multiple shorter loops than one long one, eg it’s >50x faster to do (1000 P 50)⋅(950 P 50).

If r=102, using three chunks of size 50,50,2 is much faster than 51,51 because (n P 50) is much faster than (n P 51) for some reason.

My question: is 50 the magic number on all machines or just mine? I’ll share the code and let you guys @btime it on your machines if curious. I suggest trying floatperm(500,109). My machine reports 1.7ns, which is the 2nd-shortest time it ever reports for any function (since it seemingly can’t report anything in-between that and 0.01ns). If you get something like 8ns or higher, that probably means 50 isn’t universal. If you feel like it, try reducing the chunk size until you find your machine’s magic number. If it varies, perhaps I can achieve optimal speed on all machines simply by reducing the size.

I’ve only tried permutations (sequential products), so idk if the magic 50 applies to all products.

``````@inline function floatperm(n::Int64, r::Int64)
#n<171 && return floatfact(n) / floatfact(n-r)
nrdiff = n-r
if r>50
nbreak = nrdiff+51
perms = floatperm(n,50)
n -= 50
n<nbreak && return perms*floatperm(n, n-nrdiff)
while true
perms *= floatperm(n,50)
n -= 50
n<nbreak && return perms*floatperm(n, n-nrdiff)
end
else
perms = Float64(nrdiff+1)
for j=(nrdiff+2):n begin perms*=j end end
end
return perms
end
``````

Note 1: if anyone wants to uncomment the 1st line for a conditional speedup, I’ll paste the floatfact() code, which is just a hardcoded list of return values for each factorial up to 170 (the max for Float64).

Note 2: by trial and error, I found that it’s significantly slower to start with perms=1.0 and put all the work inside the loop, which is why I don’t do that above.

Btw, any criticism of my code is always welcome, be it on-topic or not.

1 Like

If you want to avoid loss of precision, `maxintfloat` is useful.

• max factorial <= maxintfloat(Float64) is factorial(19)
• max factorial such that BigInt(Float64(factorial)) == factorial is factorial(22)
``````julia> maxintfloat(Float64)
9.007199254740992e15

julia> Int64(maxintfloat(Float64))
9007199254740992
``````
1 Like

I do not understand what happens here, but here are a few notes.

First, let me say that any benchmark time in the order of 1ns is suspiciously low: there is no way a CPU (currently featuring something like a 2 to 4GHz frequency) would perform several dozens FP operations in such a short time. This usually indicates that, somewhere in the code optimization process, an O(n) algorithm was transformed into an O(1) one.

Here is the smallest benchmark I could construct exhibiting the behavior you describe:

``````function floatperm(n::Int64, r::Int64)
nrdiff = n-r
perms  = 1.0
for j=nrdiff+1:n
perms *= j
end
return perms
end
``````

For `r>=50` we see, as expected, a roughly linear time complexity:

``````julia> using BenchmarkTools

julia> @btime floatperm(200, 150)
171.834 ns (0 allocations: 0 bytes)
Inf

julia> @btime floatperm(200, 100)
101.068 ns (0 allocations: 0 bytes)
8.450550186924625e216

julia> @btime floatperm(200, 50)
43.554 ns (0 allocations: 0 bytes)
1.3803691006536055e112
``````

However, for lower values of `r`, the computing time suddenly drops, and the global time complexity becomes constant:

``````julia> @btime floatperm(200, 49)
1.881 ns (0 allocations: 0 bytes)
9.141517222871563e109

julia> @btime floatperm(200, 48)
1.882 ns (0 allocations: 0 bytes)
6.014156067678662e107

julia> @btime floatperm(200, 25)
1.882 ns (0 allocations: 0 bytes)
7.013724258999707e56

julia> @btime floatperm(200, 10)
1.882 ns (0 allocations: 0 bytes)
8.147020443654738e22
``````

This could hint at some clever heuristic performed by the compiler, except that the compiler generates a very very straightforward code in this simple case:

``````julia> @code_native floatperm(200, 10)
``````
Complete output
``````        .text
; ┌ @ REPL:2 within `floatperm'
; │┌ @ REPL:2 within `-'
movq    %rdi, %rax
subq    %rsi, %rax
; │└
; │ @ REPL:4 within `floatperm'
; │┌ @ int.jl:53 within `+'
leaq    1(%rax), %rcx
; │└
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:277 within `UnitRange'
; │││┌ @ range.jl:282 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:410 within `<='
cmpq    %rdi, %rcx
; ││││└└
cmovleq %rdi, %rax
movabsq \$140429506324992, %rdx  # imm = 0x7FB84AD72A00
; │└└└
; │┌ @ range.jl:593 within `iterate'
; ││┌ @ range.jl:477 within `isempty'
; │││┌ @ operators.jl:294 within `>'
; ││││┌ @ int.jl:49 within `<'
cmpq    %rcx, %rax
; │└└└└
jge     L37
vmovsd  (%rdx), %xmm0           # xmm0 = mem,zero
; │ @ REPL:7 within `floatperm'
retq
; │ @ REPL:5 within `floatperm'
; │┌ @ promotion.jl:312 within `*'
; ││┌ @ promotion.jl:282 within `promote'
; │││┌ @ promotion.jl:259 within `_promote'
; ││││┌ @ number.jl:7 within `convert'
; │││││┌ @ float.jl:60 within `Float64'
L37:
negq    %rax
vmovsd  (%rdx), %xmm0           # xmm0 = mem,zero
nopl    (%rax)
L48:
vcvtsi2sdq      %rcx, %xmm2, %xmm1
; ││└└└└
; ││ @ promotion.jl:312 within `*' @ float.jl:405
vmulsd  %xmm1, %xmm0, %xmm0
; │└
; │┌ @ range.jl:597 within `iterate'
; ││┌ @ promotion.jl:398 within `=='
leaq    (%rax,%rcx), %rdx
; ││└
; ││┌ @ promotion.jl:398 within `=='
cmpq    \$1, %rdx
; │└└
jne     L48
; │ @ REPL:7 within `floatperm'
retq
nopl    (%rax)
; └
``````

So I’m puzzled because I don’t see such a code transformation could occur, except maybe within the CPU itself (the results above were generated on a Core i5-6200U, if some hardware expert comes by)

2 Likes

When the `@code_native` doesn’t support the pattern we see in the benchmarks, that’s a sign that LLVM is defeating the benchmark. When that happens, the `Ref` trick often helps:

``````julia> @btime floatperm(200, \$(Ref(10))[])
5.617 ns (0 allocations: 0 bytes)
8.147020443654738e22

julia> @btime floatperm(200, \$(Ref(25))[])
14.022 ns (0 allocations: 0 bytes)
7.013724258999707e56

julia> @btime floatperm(200, \$(Ref(48))[])
29.919 ns (0 allocations: 0 bytes)
6.014156067678662e107

julia> @btime floatperm(200, \$(Ref(49))[])
30.503 ns (0 allocations: 0 bytes)
9.141517222871563e109

julia> @btime floatperm(200, \$(Ref(50))[])
31.091 ns (0 allocations: 0 bytes)
1.3803691006536055e112

julia> @btime floatperm(200, \$(Ref(100))[])
74.816 ns (0 allocations: 0 bytes)
8.450550186924625e216

julia> @btime floatperm(200, \$(Ref(150))[])
119.684 ns (0 allocations: 0 bytes)
Inf
``````

EDIT:
But there definitely are magic numbers for some operations.
x86_64 CPUs with the AVX instruction set but without AVX512F have 16 floating point registers that are each 256 bits, meaning they can hold 4 `Float64` each, for a total of 64 `Float64`.
To minimize moving numbers into and out of registers, many optimized algorithms will hold as many numbers as possible in the registers and try to repeatedly use them.
But if in writing the code, you end up trying to hold too many, the registers will start “spilling” (to the stack), and you’ll end up with a huge number of move instructions moving memory into and out of the registers.

5 Likes

Interesting, yes the magic speed jump from 51 to 50 disappears when I use Ref. Rather than 1.71ns, floatperm(500, 50) takes 85ns and r=51 takes 86ns.

I’m not familiar with Ref, so I’m not sure how to interpret that. My first guess is that the takeaway is the linearity of the relationship, so without ref, when @btime tells me t(51)=56ns and t(50)=1.7ns, I should figure that the real t(50) is about 55ns. Would that be correct, or should I take the Ref results literally and figure those r’s take 85 and 86 ns without Ref too?

What is real? If the compiler is optimizing for for 50 let it optimize for 50…What is your ultimate goal? To write the program so it performs well? Or write the program so it can’t be optimized too well?

If you can do your work in chunks of 50 that sounds like the best performance you are going to get. Wasn’t that the goal?

1 Like

Yes, but I don’t know what Ref does or tells me. Is it telling me that @btime was lying to me, or something else? (I guess you’re saying @btime wasn’t lying and that 1.71ns was real?)

Edit - oh, I guess Ref avoids the compiler optimization?

Ref is basically a pointer (that can’t be null) to a value rather than the value itself…which apparently causes LLVM to not short circuit the computations. So btime isn’t “lying” LLVM created code that when a values is less than 50 does a lookup or something.

To me this raises the question of why @code_native doesn’t show the “short circuited” assembly…I do find that kind of disturbing.

2 Likes

Ah ok thanks.

Hm well the optimization seems not to be universal, because for @ffevotte the magic limit was 49 rather than 50.

I guess I’ll tweak the code to use chunks of 40 or 45 to hopefully capture the optimization on all machines.

Edit: weird, I added a “chunk” parameter defaulted to 50, so it’s floatperm(n::Int64, r::Int64, chunk=50)
floatperm(500,109) takes 190ns
floatperm(500, 109, 50) takes 2ns.

The compiler appears to be fickle!

Re-edit: for those who might use it, here it is with a hardcoded chunk size of 40, along with the accompanying factorial functions. I also wrote a sizecheck function permfitsfloat() so that one can know in advance if floatperm() will be return a finite value. (I went through that effort when floatperm() was slower; now, the permfitsfloat() function doesn’t save a ton of time.)

``````@inline function floatperm(n::Int64, r::Int64)
n<171 && return floatfact(n) / floatfact(n-r)
nrdiff = n-r
if r>40
nbreak = nrdiff+41
perms = floatperm(n,40)
n -= 40
n<nbreak && return perms*floatperm(n, n-nrdiff)
while true
perms *= floatperm(n,40)
n -= 40
n<nbreak && return perms*floatperm(n, n-nrdiff)
end
else
perms = Float64(nrdiff+1)
for j=(nrdiff+2):n begin perms*=j end end
end
return perms
end

@inline function fact(n::Int64)
(n==1 || n==0) && return 1
n==2 && return 2
n==3 && return 6
n==4 && return 24
n==5 && return 120
n==6 && return 720
n==7 && return 5040
n==8 && return 40320
n==9 && return 362880
n==10 && return 3628800
n==11 && return 39916800
n==12 && return 479001600
n==13 && return 6227020800
n==14 && return 87178291200
n==15 && return 1307674368000
n==16 && return 20922789888000
n==17 && return 355687428096000
n==18 && return 6402373705728000
n==19 && return 121645100408832000
n==20 && return 2432902008176640000
n>20 && throw(DomainError(n,"factorial exceeds Int64"))
throw(DomainError(n,"n cannot be negative"))
end
@inline function fact128(n::Int64)
n<21 && return Int128(fact(n))
n==21 && return 51090942171709440000
n==22 && return 1124000727777607680000
n==23 && return 25852016738884976640000
n==24 && return 620448401733239439360000
n==25 && return 15511210043330985984000000
n==26 && return 403291461126605635584000000
n==27 && return 10888869450418352160768000000
n==28 && return 304888344611713860501504000000
n==29 && return 8841761993739701954543616000000
n==30 && return 265252859812191058636308480000000
n==31 && return 8222838654177922817725562880000000
n==32 && return 263130836933693530167218012160000000
n==33 && return 8683317618811886495518194401280000000
throw(DomainError(n,"factorial exceeds Int128"))
end
@inline function floatfact(n::Int64)
n<34 && return Float64(fact128(n))
n==34 && return 2.9523279903960416e38
n==35 && return 1.0333147966386145e40
n==36 && return 3.7199332678990125e41
n==37 && return 1.3763753091226346e43
n==38 && return 5.230226174666011e44
n==39 && return 2.0397882081197444e46
n==40 && return 8.159152832478977e47
n==41 && return 3.345252661316381e49
n==42 && return 1.40500611775288e51
n==43 && return 6.041526306337383e52
n==44 && return 2.658271574788449e54
n==45 && return 1.1962222086548019e56
n==46 && return 5.502622159812089e57
n==47 && return 2.5862324151116818e59
n==48 && return 1.2413915592536073e61
n==49 && return 6.082818640342675e62
n==50 && return 3.0414093201713376e64
n==51 && return 1.5511187532873822e66
n==52 && return 8.065817517094388e67
n==53 && return 4.2748832840600255e69
n==54 && return 2.308436973392414e71
n==55 && return 1.2696403353658276e73
n==56 && return 7.109985878048635e74
n==57 && return 4.0526919504877214e76
n==58 && return 2.3505613312828785e78
n==59 && return 1.3868311854568984e80
n==60 && return 8.32098711274139e81
n==61 && return 5.075802138772248e83
n==62 && return 3.146997326038794e85
n==63 && return 1.98260831540444e87
n==64 && return 1.2688693218588417e89
n==65 && return 8.247650592082472e90
n==66 && return 5.443449390774431e92
n==67 && return 3.647111091818868e94
n==68 && return 2.4800355424368305e96
n==69 && return 1.711224524281413e98
n==70 && return 1.1978571669969892e100
n==71 && return 8.504785885678623e101
n==72 && return 6.1234458376886085e103
n==73 && return 4.4701154615126844e105
n==74 && return 3.307885441519386e107
n==75 && return 2.48091408113954e109
n==76 && return 1.8854947016660504e111
n==77 && return 1.4518309202828587e113
n==78 && return 1.1324281178206297e115
n==79 && return 8.946182130782976e116
n==80 && return 7.156945704626381e118
n==81 && return 5.797126020747368e120
n==82 && return 4.753643337012842e122
n==83 && return 3.945523969720659e124
n==84 && return 3.314240134565353e126
n==85 && return 2.81710411438055e128
n==86 && return 2.4227095383672734e130
n==87 && return 2.107757298379528e132
n==88 && return 1.8548264225739844e134
n==89 && return 1.650795516090846e136
n==90 && return 1.4857159644817615e138
n==91 && return 1.352001527678403e140
n==92 && return 1.2438414054641308e142
n==93 && return 1.1567725070816416e144
n==94 && return 1.087366156656743e146
n==95 && return 1.032997848823906e148
n==96 && return 9.916779348709496e149
n==97 && return 9.619275968248212e151
n==98 && return 9.426890448883248e153
n==99 && return 9.332621544394415e155
n==100 && return 9.332621544394415e157
n==101 && return 9.42594775983836e159
n==102 && return 9.614466715035127e161
n==103 && return 9.90290071648618e163
n==104 && return 1.0299016745145628e166
n==105 && return 1.081396758240291e168
n==106 && return 1.1462805637347084e170
n==107 && return 1.226520203196138e172
n==108 && return 1.324641819451829e174
n==109 && return 1.4438595832024937e176
n==110 && return 1.588245541522743e178
n==111 && return 1.7629525510902446e180
n==112 && return 1.974506857221074e182
n==113 && return 2.2311927486598138e184
n==114 && return 2.5435597334721877e186
n==115 && return 2.925093693493016e188
n==116 && return 3.393108684451898e190
n==117 && return 3.969937160808721e192
n==118 && return 4.684525849754291e194
n==119 && return 5.574585761207606e196
n==120 && return 6.689502913449127e198
n==121 && return 8.094298525273444e200
n==122 && return 9.875044200833601e202
n==123 && return 1.214630436702533e205
n==124 && return 1.506141741511141e207
n==125 && return 1.882677176888926e209
n==126 && return 2.372173242880047e211
n==127 && return 3.0126600184576594e213
n==128 && return 3.856204823625804e215
n==129 && return 4.974504222477287e217
n==130 && return 6.466855489220474e219
n==131 && return 8.47158069087882e221
n==132 && return 1.1182486511960043e224
n==133 && return 1.4872707060906857e226
n==134 && return 1.9929427461615188e228
n==135 && return 2.6904727073180504e230
n==136 && return 3.659042881952549e232
n==137 && return 5.012888748274992e234
n==138 && return 6.917786472619489e236
n==139 && return 9.615723196941089e238
n==140 && return 1.3462012475717526e241
n==141 && return 1.898143759076171e243
n==142 && return 2.695364137888163e245
n==143 && return 3.854370717180073e247
n==144 && return 5.5502938327393044e249
n==145 && return 8.047926057471992e251
n==146 && return 1.1749972043909107e254
n==147 && return 1.727245890454639e256
n==148 && return 2.5563239178728654e258
n==149 && return 3.80892263763057e260
n==150 && return 5.713383956445855e262
n==151 && return 8.62720977423324e264
n==152 && return 1.3113358856834524e267
n==153 && return 2.0063439050956823e269
n==154 && return 3.0897696138473508e271
n==155 && return 4.789142901463394e273
n==156 && return 7.471062926282894e275
n==157 && return 1.1729568794264145e278
n==158 && return 1.853271869493735e280
n==159 && return 2.9467022724950384e282
n==160 && return 4.7147236359920616e284
n==161 && return 7.590705053947219e286
n==162 && return 1.2296942187394494e289
n==163 && return 2.0044015765453026e291
n==164 && return 3.287218585534296e293
n==165 && return 5.423910666131589e295
n==166 && return 9.003691705778438e297
n==167 && return 1.503616514864999e300
n==168 && return 2.5260757449731984e302
n==169 && return 4.269068009004705e304
n==170 && return 7.257415615307999e306
throw(DomainError(n,"factorial exceeds Float64"))
end

@inline function permfitsfloat(n::Int64, r::Int64)
r>170 && return false
r>167 && return n<171
r>165 && return n<172
r==165 && return n<173
r==164 && return n<174
r>161 && return n<175
r==161 && return n<177
r==160 && return n<178
r==159 && return n<179
r==158 && return n<181
r==157 && return n<182
r==156 && return n<184
r==155 && return n<186
r==154 && return n<188
r==153 && return n<190
r==152 && return n<192
r==151 && return n<194
r==150 && return n<197
r==149 && return n<200
r==148 && return n<203
r==147 && return n<206
r==146 && return n<209
r==145 && return n<213
r==144 && return n<217
r==143 && return n<221
r==142 && return n<225
r==141 && return n<229
r==140 && return n<234
r==139 && return n<239
r==138 && return n<245
r==137 && return n<251
r==136 && return n<257
r==135 && return n<264
r==134 && return n<270
r==133 && return n<278
r==132 && return n<286
r==131 && return n<294
r==130 && return n<303
r==129 && return n<313
r==128 && return n<323
r==127 && return n<333
r==126 && return n<345
r==125 && return n<357
r==124 && return n<370
r==123 && return n<384
r==122 && return n<399
r==121 && return n<415
r==120 && return n<432
r==119 && return n<450
r==118 && return n<470
r==117 && return n<491
r==116 && return n<514
r==115 && return n<538
r==114 && return n<564
r==113 && return n<592
r==112 && return n<622
r==111 && return n<655
r==110 && return n<690
r==109 && return n<728
r==108 && return n<770
r==107 && return n<814
r==106 && return n<863
r==105 && return n<916
r==104 && return n<973
r==103 && return n<1035
r==102 && return n<1104
r==101 && return n<1178
r==100 && return n<1260
r==99 && return n<1349
r==98 && return n<1447
r==97 && return n<1555
r==96 && return n<1674
r==95 && return n<1805
r==94 && return n<1950
r==93 && return n<2110
r==92 && return n<2288
r==91 && return n<2486
r==90 && return n<2706
r==89 && return n<2952
r==88 && return n<3228
r==87 && return n<3536
r==86 && return n<3883
r==85 && return n<4274
r==84 && return n<4716
r==83 && return n<5217
r==82 && return n<5785
r==81 && return n<6432
r==80 && return n<7172
r==79 && return n<8019
r==78 && return n<8992
r==77 && return n<10115
r==76 && return n<11414
r==75 && return n<12922
r==74 && return n<14679
r==73 && return n<16735
r==72 && return n<19148
r==71 && return n<21995
r==70 && return n<25365
r==69 && return n<29374
r==68 && return n<34166
r==67 && return n<39919
r==66 && return n<46863
r==65 && return n<55289
r==64 && return n<65568
r==63 && return n<78182
r==62 && return n<93755
r==61 && return n<113103
r==60 && return n<137301
r==59 && return n<167777
r==58 && return n<206442
r==57 && return n<255874
r==56 && return n<319585
r==55 && return n<402403
r==54 && return n<511027
r==53 && return n<654854
r==52 && return n<847205
r==51 && return n<1107185
r==50 && return n<1462520
r==49 && return n<1953971
r==48 && return n<2642270
r==47 && return n<3619210
r==46 && return n<5025641
r==45 && return n<7081189
r==44 && return n<10134211
r==43 && return n<14747383
r==42 && return n<21847332
r==41 && return n<32991949
r==40 && return n<50859028
r==39 && return n<80161747
r==38 && return n<129409531
r==37 && return n<214391995
r==36 && return n<365284302
r==35 && return n<641619566
r==34 && return n<1164971149
r==33 && return n<2193067454
r==32 && return n<4294967312
r==31 && return n<8784165360
r==30 && return n<18843309684
r==29 && return n<42606214599
r==28 && return n<102116749996
r==27 && return n<261120709467
r==26 && return n<717712645928
r==25 && return n<2138890849000
r==24 && return n<6981463658344
r==23 && return n<25256835601068
r==22 && return n<102701782189099
r==21 && return n<477304994735077
r==20 && return n<2586638741762884
r==19 && return n<16746821753625199
r==18 && return n<133432608295961812
r==17 && return n<1357157737484444817
return true
end
``````

I would suspect the btime is lying. (Not lying in terms of the micro-benchmark, but misrepresenting what you would see in general use)

I happen to read some of this post, just before yours:
PSA: Microbenchmarks remember branch history

I suspect that with the small code size for the very simplistic micro-benchmark of calling the same function with the same inputs the whole time is skewing what your seeing.

Granted there might be some value in the argument based on number of available registers which might make small chunks below a certain cut-off better.

I would be more inclined to believe a test where you:
create an array of 1000 acceptable inputs
random permute this list
create a function that in a for loop that runs through this list of inputs
time or even btime the function

Hopefully that should provide enough variability to prevent unrealistic tricks and give a more realistic representation of what you would see in practice when using it on different inputs for every call.

1 Like

I think you’re right. Btime is honest but my processor is cheating on the test.

``````@inline randperm1() = permy(rand(100:10000), rand(51:61))
@inline randperm2() = permy(rand(100:10000), rand(35:45))

@inline function permy(n::Int64, r::Int64)
nrdiff = n-r
perms = Float64(nrdiff+1)
for j=(nrdiff+2):n begin perms*=j end end
return perms
end
``````

The time difference between randperm1() and randperm2() is only 27ns, not the 54ns difference I get between permy(n,51) and permy(n,50). The average r-values are 56 and 40, so 16 apart. If I test permy() with r=76 and r=60, the time difference is about 28ns. So an r of 56 is to 40 as 76 is to 60, meaning r<51 is not special. At least, it’s not special when the processor no longer has access to the answer key.

This sucks. Lesson learned: from now on I’ll use randomness in every nano-benchmark. (It would probably also be wise for me to spend less time attempting nano-optimizations.)

That may be overkill. Instead, I would recommend micro-optimizations in the context of a larger problem, when you have already identified them as important — eg roughly x% in lines of code taking up 10x% of runtime. The larger context will make sure you are benchmarking what is relevant for your purposes.

2 Likes