I’ve written a fast Julia-only permutation (n P r) function intended for perms small enough to fit into a Float64 (at the expense of some precision). Thus, if a permutation doesn’t exceed 1.7976931348623157e308, one can avoid slow BigInts.
However, much of its speed relies on a magic threshold whose universality I’m unsure of. On my machine (with Julia 1.4), computing a permutation of length r>50 is faster when split into chunks of length 50 or less. For instance, (1000 P 100) = 1000⋅999⋅998⋅…⋅901, which exceeds typemax(Int128). I define a Float64 to store/update the result when multiplied by each Int64 term one-by-one in a loop. Curiously, it’s faster to use multiple shorter loops than one long one, eg it’s >50x faster to do (1000 P 50)⋅(950 P 50).
If r=102, using three chunks of size 50,50,2 is much faster than 51,51 because (n P 50) is much faster than (n P 51) for some reason.
My question: is 50 the magic number on all machines or just mine? I’ll share the code and let you guys @btime it on your machines if curious. I suggest trying floatperm(500,109). My machine reports 1.7ns, which is the 2nd-shortest time it ever reports for any function (since it seemingly can’t report anything in-between that and 0.01ns). If you get something like 8ns or higher, that probably means 50 isn’t universal. If you feel like it, try reducing the chunk size until you find your machine’s magic number. If it varies, perhaps I can achieve optimal speed on all machines simply by reducing the size.
I’ve only tried permutations (sequential products), so idk if the magic 50 applies to all products.
@inline function floatperm(n::Int64, r::Int64)
#n<171 && return floatfact(n) / floatfact(n-r)
nrdiff = n-r
if r>50
nbreak = nrdiff+51
perms = floatperm(n,50)
n -= 50
n<nbreak && return perms*floatperm(n, n-nrdiff)
while true
perms *= floatperm(n,50)
n -= 50
n<nbreak && return perms*floatperm(n, n-nrdiff)
end
else
perms = Float64(nrdiff+1)
for j=(nrdiff+2):n begin perms*=j end end
end
return perms
end
Note 1: if anyone wants to uncomment the 1st line for a conditional speedup, I’ll paste the floatfact() code, which is just a hardcoded list of return values for each factorial up to 170 (the max for Float64).
Note 2: by trial and error, I found that it’s significantly slower to start with perms=1.0 and put all the work inside the loop, which is why I don’t do that above.
Btw, any criticism of my code is always welcome, be it on-topic or not.
Thanks in advance!