Potential solution to the overflow problem; 64-bit considered harmful, 32- or 21-bit better

No, because even if you get to Int64, and use that subsequently it will be checked (unlike for status quo), but yes, you want to avoid going that far for safety AND performance reasons (with Int21 you have a budget of three multiplies before slowdown, or more as explained with my polynomial example, or 43 additions). But no matter the size of the resulting type you end up with, it will be checked. I just explained, we could go to Int128 next, but likely you want to go to BigInt directly (and possibly it will have a fast path for values that would have fit into Int64 or Int128).

Yes, it will be faster because there will be no check (as long as the result fits in one register), i.e. in the common case (for Int63 inputs to addition, and seemingly Int64 too; for multiply of Int64 with Base.checked_mul it seems much worse than my idea), and you get a three-byte assembly code with my scheme, just an add (or addq or similar) which is as fast as non-checked 32-bit add (which seems to be shorter code than 64-bit add that gets me “lea rax, [rdi + rsi]”.

The kicker is it will work with SIMD code too (and provide 2x memory bandwidth), except for all other overflow check schemes:

No flags are touched by the underlying PADDD instruction.

So to test this, you have to write additional code, depending on what you want to do.

There’s some reason they’re not the default, and as you show the code expansion is huge. Non-taken branches are cheap, I can believe 15% overhead or less, that statistic seems rather high if the branches were the only added code, but the bad thing about those functions are the huge L1 Dcache pollution, and also most people don’t use or know of them. Also 1 letter operators much better… or 0 letters as in 2x.