ANN: Anyone need my 2.7x faster factorial (or gamma) function?

I was looking into faster code generation (not factorial specifically, but started with that function).

It took a lot longer than it seemed it would at first. Anyway the Julia PR was closed, but I have more recent code, that hopefully doesn’t segfault (line 1-24 in my fork):

[EDIT: code in the PR below is old, still the speedup in the comment there (the latter link) applies to the new and old code]
https://github.com/JuliaLang/julia/pull/30685#issuecomment-456373425

https://github.com/JuliaLang/julia/pull/30685#issuecomment-456114578

Does anyone have good real life code that uses factorial that I could use to benchmark with?

I know how to debug/prevent segfaults with julia --check-bounds=no (what I think; would have thought CI uses).

1 Like

It is a primitive function, you should be able to benchmark it as is.

From the discussion of the PR, it is unclear to me how this is an improvement, can you explain what you changed?

I would have thought so, but it was surprisingly difficult. I’m not 100% sure why a simple benchmark didn’t show speedup, but think it has to do with inlining. I however have tests that do show improvements, so I asked for a real world use case to look into if faster there. I don’t really have to make factorial faster for me, I just thought it would help Julia so take it or leave it, use my code and/or point to code that uses and I may make another PR.

It is possible that you are measuring something else. Hard to say without the actual code.

Again, I still don’t understand what the modification is. Unless you explain what you think you improved and why you think this should result in a speedup, this is unlikely to be used.

The modification is explained here (that old PR is linked from my newer one): Please CLOSE this one in favor of other PR: Code for faster factorial (and gamma) by PallHaraldsson · Pull Request #30665 · JuliaLang/julia · GitHub

Basically I was trying to get inlining and one not-taken jump on the fast path instead of the three taken jumps before. Part of the optimization is shifting the lookup table (what I do not expect any compiler to do) to get rid of one jump; potentially a better compiler could have figured out the rest.

Can you summarise somewhere exactly what’s faster and what isn’t? Like a list of @btime examples? What I can glean from the PR discussion is that you’re claiming an improvement in things like factorial(Int128(21)) (which would overflow in Int64).

The last time I wanted factorial(21) it was soon going to be divided by something, thus working immediately with floats (and gamma functions) made sense. How does this compare?

1 Like

And, most of the time, with logs, eg SpecialFunctions.lfactorial and lgamma.

1 Like

You may be over-estimating the sophistication of my code here! But thanks, I didn’t know about these.

Good idea! Indeed there seems to be potential for a faster implementation. It probably won’t make much of a difference in practice (and if it did, you can probably solve the problem in a better way than optimizing factorial), but still. I get the impression that your PR could perhaps be made in a cleaner way and/or more similar to the current implementation? Faster is not always better. Here’s an idea (proof of concept):

const _shifted_fact_table64 = [1; cumprod(1:Int64(20))]

function faster_fact(n::Integer)
    negative(n) = throw(DomainError(n, "`n` must not be negative."))
    overflow(n) = throw(OverflowError(string(n, " is too large to look up in the table")))
    0 ≤ n ≤ 20 || n |> (n < 0 ? negative : overflow)
    @inbounds f = _shifted_fact_table64[n+1]
    return oftype(n, f)
end

Sample timings, using the extreme case of all values either 0 and 1 (very unrealistic, but proof that there’s potential for a speedup):

julia> r = rand(0:1, 100000);

julia> s = similar(r);

julia> @btime broadcast!(factorial, $s, $r);
  786.381 μs (0 allocations: 0 bytes)

julia> @btime broadcast!(faster_fact, $s, $r);
  177.465 μs (0 allocations: 0 bytes)
1 Like

I see. Better than just 0 & 1, it’s twice as fast on the whole range:

julia> r = rand(0:20, 1000);

julia> s = similar(r);

julia> @btime map!(factorial, $s, $r);
  3.673 μs (0 allocations: 0 bytes)

julia> @btime map!(faster_fact, $s, $r);
  1.334 μs (0 allocations: 0 bytes)

Yet @btime on one factorial is unable to see this… because the computer predicts the branches, presumably, after seeing them many times?

Here’s the core of the Base factorial implementation. I think the point of faster_fact is to have fewer branches on the path to returning the answer, in the useful case of no errors, right?

function factorial_lookup(n::Integer, table, lim)
    n < 0 && throw(DomainError(n, "`n` must not be negative."))
    n > lim && throw(OverflowError(string(n, " is too large to look up in the table")))
    n == 0 && return one(n)
    @inbounds f = table[n]
    return oftype(n, f)
end

Agree this sounds worth having. Would be worth figuring out how to write this as much like the Base implementation as possible… my 5 min attempt is slow again.

Edit:

I get another factor of 2 if I make the lookup table a tuple not a vector:

const _shifted_fact_table64_tup = Tuple( _shifted_fact_table64)

julia> @btime map!(faster_fact, $s, $r);
  1.315 μs (0 allocations: 0 bytes)

julia> @btime map!(faster_fact_tup, $s, $r);
  557.941 ns (0 allocations: 0 bytes)
1 Like

The 2.6x speedup I calculated for Int64 (for any number from 0-20), just as it said in the comment, with the testing code (I added at the top). At first I was only speeding up that common case, and then way to many days to get good assembly for all types. It helps that Julia is generic by default, except when you’re trying to optimize all cases…

My optimized code for gamma uses the same principle, and probably should have more speedup. It passes CI, and could be merged as is, AFAIK (I don’t remember ever having a segfault there, personally, while see last comment in the PR).

OK, gamma for integers clearly ought to be a conversion of the same lookup… that makes sense.

For factorials, this two-line change currently gives me a 5.6x speedup (on the same map!(... story above, all 0:20.). I don’t quite understand why… but this would seemingly be a hard-to-refuse PR, since it changes so little?

julia> const _fact_table64_tup = Tuple(Base._fact_table64)

julia> Base.factorial(n::Union{Int64,UInt64}) = factorial_lookup(n, _fact_table64_tup, 20)

Edit: Using this in gamma() too gets me a factor 2.1 on the same data.

function SpecialFunctions.gamma(n::Union{Int8,UInt8,Int16,UInt16,Int32,UInt32,Int64,UInt64})
           n < 0 && throw(DomainError(n, "`n` must not be negative."))
           n == 0 && return Inf
           n <= 2 && return 1.0
           n > 20 && return gamma(Float64(n))
           @inbounds return Float64(_fact_table64_tup[n-1])
       end
1 Like

Actually, my code seem to be faster, at least on my old laptop. Possibly you just see a higher factor on better hardware? Please try my code, see new links at the top.


julia> @btime broadcast!(factorial, $s, $r);  # my factorial
  252.554 μs (0 allocations: 0 bytes)

julia> @btime broadcast!(faster_fact, $s, $r);
  373.523 μs (0 allocations: 0 bytes)

julia> @btime test(1);   # my test for your code vs. 245.989 for my code
  366.049 μs (0 allocations: 0 bytes)

Right, the point was not to check for unlikely errors first. I’m puzzled why a tuple should be faster. Is it for sure? Something else happening? You had different sizes that may also matter.

Clearly this needs to work not just on my machine, and on some more test cases. I will try a bit more later. My guess is that some magic of constant propagation or something is making it ignore the branches in the tuple case? The entire lookup table is in the type of the tuple.

But if it does hold up, then I suggest proposing the minimal possible change, just take the existing _fact_table64 and make a tuple starting with 1, and call that in the existing functions (with [n+1]). And writing up a really easy to read set of benchmarks.

Thanks for making an effort to help! PR’s to base are a PITA. If possible, and it should be in this case, it’s often much easier to put it in a package on github (or whatever). After it’s solid and the benefit is clear, you can make a PR to base or stdlib, etc. In the repo, you can include tests, benchmarks, whatever. People can then very easily try it out and make suggestions, forks, etc. I understand you may not want to take the time to do it. We’re busy, and not paid to help. But, if you do want to spend some time on it, its probably better spent on a package.

And yes, the gamma function has enormously wide applicability. There are almost certainly some cases where a faster implementation would be useful.

3 Likes

Note, my gamma PR will only make it faster for integer values, so maybe not as helpful in general? Gamma however, it’s by now in a package so that PR could be merged immediately (or at least reviewed).

Right, as in combinatorics formulas I’m aware of. With lfactorial, becuase of the log, the divide turns into a subtraction, so it seems better as it’s faster. But is it for sure? I assume you’ll then later have to do an exponential so that gain is lost, as it’s slower than the divide you eliminated?

I wasn’t even familiar with those until recently (and gamma a little earlier). So even if faster, I’m not sure all would use those. Do you know where, in what situations those are used of not the one above?

I suppose I could optimize those in the same way, just curious to know first if it would really help people.

Would it make sense to special case the float version of gamma too, to make it first check the 171 integer values and then fall back to the fractional ones?

Not an expert, but I thought the usual reason for such things was more to avoid underflow issues than for speed: if you have a lot of numbers in [0,1], or worse adding up to 1, then it’s a better use of floating point precision to take logs.

This is a good thought, maybe there should be a look-up table for these too. I see that gamma(::Int) only looks up the first 20, but 171 isn’t many to store. And checking rem(x,1) == 0 appears to be almost free compared to gamma(22.0) so I’d say that should be looked up too. But perhaps people have thought about this – maybe open an issue on SpecialFunctions to ask first?

Perhaps you could post a minimal self-contained example? To make sure we’re testing the same code. If I found the right code of yours, it’s decorated with @inline which is why it’s faster. If I add @inline in front of faster_fact, it’s in fact a lot faster than your version, on my machine and using this simple test:

r = rand(0:20, 100000); s = similar(r);
@btime broadcast!(faster_fact, $s, $r);

Actually, in my testing, the biggest gain is not from having fewer branches, but avoiding the allocation of the gcframe (check the native or llvm code of the original factorize method to see, it’s tons of code). This is caused by the throw statements, and I avoided that in the code above by extracting them to methods (negative and overflow).

Another huge gain can be achieved by inlining the method. It’s correct that using tuples here is slightly faster than an array, but at least on my system the difference is only 0.1 ns per call. The tuple code however also results in the method being inlined, which with the above test results in a ~4x faster benchmark. We can @inline the array method for a similar speedup.

After these optimizations, we have the removal of the branching, which has a smaller (but still very much noticeable) effect in my tests.

At least on my system, this is largely caused by the method being inlined in the case of using tuples, but not arrays. Try putting @inline and @noinline in front of both versions to see what the difference is.

There IS indeed a difference between tuples and arrays, but it shouldn’t be very big. The array version incurs one additional level of indirection (memory access) during the indexing. Compare the native code for the array lookup (my comments added):

movabs	rax, 4785672208           # rax = pointer to array object
mov	rax, qword ptr [rax]          # rax = pointer to raw data
mov	rax, qword ptr [rax + 8*rdi]  # rax = looked up entry

With tuple:

movabs	rax, 4930829632           # rax = pointer to tuple / raw data
mov	rax, qword ptr [rax + 8*rdi]  # rax = looked up entry

It’s a bit silly that the compiler is unable to figure this out, given that the array is const.

3 Likes