Potential solution to the overflow problem; 64-bit considered harmful, 32- or 21-bit better

We can implement FastIntegers.jl, faster than the current default Int64 and eliminate overflows from the language by leveraging Julia’s type system, and no hardware support is needed. As a package it of course doesn’t change the language itself, it would be a prototype, or actually very useful as such, but even better if integrated into the language. That could happen for Julia 2.0 but not if people are against the idea.

If we consider even very simple functions:

double(x) = 2x

then it’s not safe, for integers, i.e. they can overflow (actually modular arithmetic used, that not all want or are familiar with). There was a recent discussion about this explaining the pros and cons of overflow checking, with it incurring maybe 15% slowdown (and worse for SIMD vectorized code) but there needs be no overhead, even for SIMD code. We can have our cake and eat it too, even potentially 2x faster or more.

[You can already get the bandwidth saving, 2x or 4x faster, by opting into smaller integer types, but people shy away from them, especially the smaller they get because then the overflow potential increases, but it will be eliminated for all types, so no need to. Then no need for fear, and currently all of Julia’s integer types can overflow except BigInt (and BigFloat).]

We need overflow checking to be fast for the most common operations such as addition and multiply, and the fastest way is by avoiding it (entirely).

It’s best to explain for (Unsigned) multiplication first then addition/subtraction.

If you multiply two numbers the number of bits/digits potentially doubles (at worst, i.e. in reality usually only a few bits added), e.g. 32-bit x 32-bit = 64-bit result (but currently Julia, or the CPU, drops the first 32 bits; hopefully, for unsigned numbers, they are all zeros). This uses the imull x86 assembly instruction and for 64-bit multiply imulq is used, which is a bit slower, but the saving would mainly come from half as much memory bandwidth used.

Redefining multiplication this way (and similarly for all the integer types types and operators), means there’s never any potential overflow to check for:

julia> import Base.*
julia> *(x::Int32, y::Int32) = Int64(x) * Int64(y)

But this means you must want to be using Int32, and you might complain that in Julia Int is 64-bit, but that’s not always they case, nor needed/wanted. On 32-bit systems (still) supported Int is Int32 already so if you want correct portable code you must test for that possibility too. You might argue we need Int64 to address more memory on 64-bit systems, but then you’re also wrong.

Very recent CPUs have 57-bit addressing otherwise only 48-bit or lower available, so indexes to square arrays need only 23-27 bits for the theoretical maximum array sizes of Float64s of 256 TB to 128 PB.

The upside of using 32-bit integers or smaller is huge, 2x higher memory bandwidth and free overflow checks. In many cases you can even go lower, to e.g. 16-bit for 4x memory bandwidth.

Now for addition, adding two numbers adds only 1 bit to the result, and they are the most common arithmetic operation. So a Int33 would be useful but a bit awkward as you can’t store it in the same space (also bad if you subsequently need to multiply). With that type or even if just using Int64 as a result type for Int32 addition, we also avoid overflows for addition and subtraction. Now if we have additional (non-power of 2) types, we can do better.

And we actually do have Int16 and Int8 but it seems a bit excessive to go that low for a default type, I think that would need to be opt-in.

So what is bad about this? We of course have less range, only 65536 values for UInt16, restrictive, but 2097152 values for [U]Int21 i.e. plus minus a million for signed integers. That seems like a useful potential default if not Int32 or Int31.

I’m not suggesting new types need to bit-aligned in memory, that could be a later optimization e.g. three Int21 would fit in each 64-bit memory location (or register).

Let’s take an example:

polynominal(x) = x^3 + 2x^2 - 1

For Int21 input, we get Int21*Int21*Int21=Int63 + 2*Int21*Int21=Int43 - Int21 or Int63 + Int43 + Int21.

If you evaluate this (polynominals) from left-to-right, then you get Int64 then Int65, which is bad, but from right-to-left, you get Int44 first then Int64. This is something you might optimize in your program by rearranging, or the compiler could, because unlike for floating point, it’s allowed to.

Note, also 2x^2 needs not evaluate as Int21Int21Int21, since the literal 2 needs only add one bit (like addition, but literals 3 and 4 would need to add 2 bits. For the power ^ operator Julia does similar compile-time optimization, that seems could be applied similarly here, with: @inline literal_pow(f, x, ::Val{p}) where {p} = f(x,p)

Most calculations do not involve power (which I actually think should give you a float in general except for Unsigned or positive literal integers, that can also be handled in a good way) I’m just showing in many common cases they are neither a problem.

Most calculations are additions/subtractions and only add 1 bit at a time (or double the number of bits for multiply). If at any point you want to store back results to memory then you have a choice, that location needs to be large enough as the type you ended up with (which may actually not be larger as the expanded type can be a bit of an illusion), or you (or the compiler) needs to reduce the number of bits. The extra bits are most likely to be zeros anyway (or all ones for negative numbers, neither a problem). If at any point you did e.g. a divide converting to float (or any other operation doing that or to a rational or BigInt) then also you do not have a problem. E.g.

dist = sqrt(x^2 + y^2)

That function would fail for all of Julia’s integer types (except BigInt) when you overflow, but not with my scheme for any integer type (even if Int64 is used as inputs), because you always get a larger type before that happens, and for highest speed you want to make sure first your inputs are no larger than Int32.

Julia already has an Int128 type (and Int256 up to Int1024 available in package) but defining Int64Int128 is going to be slower, why it’s important for the inputs to multiply being Int32 (or smaller) and add/subtract at Int63 or smaller. You can go higher, and it only gets gradually slower and then you can end with BigInt or better its much faster replacement ArbNumerics.jl (or potentially floats is wanted).

1 Like

If I’m understanding your proposal, it’s to have *(::Int32, ::Int32) give an Int64 and also to use Int32 by default (and more complicated variants of that kind of thing). I think the basic issue there is that as soon as you do one *, the result is an Int64, and you’re back where we started.

4 Likes

On X86, overflow checks should be cheap already because the carry and overflow flags will be set.
See MUL — Unsigned Multiply or IMUL — Signed Multiply for example.
Promoting to a larger integer size and then checking will not be faster.

You can confirm that Base.checked_add’s generated code is just an add followed by jo (jump if overflow flag is set):

# julia> @code_native debuginfo=:none syntax=:intel Base.checked_add(typemax(Int), 2)
        .text
        .file   "checked_add"
        .globl  julia_checked_add_5462          # -- Begin function julia_checked_add_5462
        .p2align        4, 0x90
        .type   julia_checked_add_5462,@function
julia_checked_add_5462:                 # @julia_checked_add_5462
        .cfi_startproc
# %bb.0:                                # %top
        mov     rdx, rsi
        mov     rax, rdi
        add     rax, rsi
        jo      .LBB0_1
# %bb.2:                                # %L7
        ret
.LBB0_1:                                # %L5
        sub     rsp, 8
        .cfi_def_cfa_offset 16
        mov     rsi, rdi
        movabs  rax, offset j_throw_overflowerr_binaryop_5464
        movabs  rdi, 140041153683800
        call    rax
        ud2
.Lfunc_end0:
        .size   julia_checked_add_5462, .Lfunc_end0-julia_checked_add_5462
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

Base.checked_add, Base.checked_mul, etc all already exist and are going to be about as efficient as it gets.

3 Likes

No, because even if you get to Int64, and use that subsequently it will be checked (unlike for status quo), but yes, you want to avoid going that far for safety AND performance reasons (with Int21 you have a budget of three multiplies before slowdown, or more as explained with my polynomial example, or 43 additions). But no matter the size of the resulting type you end up with, it will be checked. I just explained, we could go to Int128 next, but likely you want to go to BigInt directly (and possibly it will have a fast path for values that would have fit into Int64 or Int128).

Yes, it will be faster because there will be no check (as long as the result fits in one register), i.e. in the common case (for Int63 inputs to addition, and seemingly Int64 too; for multiply of Int64 with Base.checked_mul it seems much worse than my idea), and you get a three-byte assembly code with my scheme, just an add (or addq or similar) which is as fast as non-checked 32-bit add (which seems to be shorter code than 64-bit add that gets me “lea rax, [rdi + rsi]”.

The kicker is it will work with SIMD code too (and provide 2x memory bandwidth), except for all other overflow check schemes:

No flags are touched by the underlying PADDD instruction.

So to test this, you have to write additional code, depending on what you want to do.

There’s some reason they’re not the default, and as you show the code expansion is huge. Non-taken branches are cheap, I can believe 15% overhead or less, that statistic seems rather high if the branches were the only added code, but the bad thing about those functions are the huge L1 Dcache pollution, and also most people don’t use or know of them. Also 1 letter operators much better… or 0 letters as in 2x.

In order to compile an integer computation which completely avoids any runtime overflow checks, the compiler at the very least needs to be able to bound the size of the outputs in terms of what it knows about the inputs so that it can select a correctly sized integer type for the result.

For simple computations like a*b this is easy enough, but bounding the size of the output of some arbitrary piece of code is very hard. Therefore avoiding runtime checks is very hard.

The alternative is to rely on the user knowing such a bound in advance (using their domain knowledge of the particular computation), which if you think about it is how Julia presently behaves with Int arithmetic: if the user opted to use Int then the user must be sure the result cannot overflow (or doesn’t care if it does).

You can’t rely on (and users don’t really opt in to Int or Int64 so much, I think they use out of habit, as that’s the default, which is my main message):

julia> typemax(Int)
9223372036854775807

On 32-bit it’s “only” 2147483647, but yes, that’s already very generous (while more likely to overflow, e.g. for intermediate calculations like the distance calculation I showed, would for x or y as high as 46341 or 32-bit), why I suggested Int21 as a happy medium with 1048576 as a maximum, for input data, for output (and all intermediate numbers), the sky is the limit. If I learned anything from Steve Jobs/macOS, then it’s sensible defaults are good. I do not rule out smaller types used, just not as defaults, if you need to chose one.

EDIT: I assumed ^ would give you Int32 on 32-bit, but actually Int32 input for it or * widens to Int64 output (but Int8, Int16 and Int64 do not widen), at least on 64-bit, so my idea is already partially implemented, while it can still overflow for Int64 input, and most likely will for Int16 or smaller.

Functions are not a problem, but loops are… I’m explain under B. a partial or full solution. And you don’t need to know or calculate the bound yourself, Julia gives you the types, you would only need maybe a debug mode to see them or a linter (the overflow/safety problem turns into an optimization problem compilers or linters can be good at, i.e. avoid or alerting you when your code goes to a problematically larger/slower type or BigInt).

A.
There are only 4 elementary integer operations to consider, and Julia (can) takes care of those, with them simply redefined as explained. They are *, you consider simple, + and - that are equivalent and simpler adding only one bit, but then there’s the pesky ^ power operator.

Julia already special-cases some literal powers, e.g. x^2 and x^3 making them simple, basically repeated * which is already simple. For arbitrary power x^p I argue it returns a rational, mathematically, either v/1 or 1/v where v can be arbitrarily large. This case can be handled with floating-point that takes care of the overflow checking for you for free (but it’s not modular arithmetic, which isn’t clearly better and so you would need to opt into, but neither are the other operators modular any longer).

Division already returns Float64, and power could do the same being consistent with it (but I’m considering rather Float32). Just as with the current division, this is a compromise over rationals and e.g. DecFP.

Every larger piece of code is built up from these and Julia’s type inference takes care of the rest. Except:

At some point you need to put your value into some memory location (array) of potentially a different type than your inputs, and then the the runtime check could be done amortized over a long chain of calculations or if you actually want modular arithmetic then it can also be postponed until then (if you did promote to floating-point, it’s already an error to try that, but x^p could continue to work if p is an UInt, in case BigInt returned, as is appropriate in that case).

B.
Now for loops, if you do:

julia> sum = Int21(0)  # this could even start at typemax for Int21
julia> for i in 1:8_000_000_000_000
         sum += 1      # and the increment could even be typemax for Int21
       end

It looks like sum would overflow, but it wouldn’t (necessarily). Yes, I proposed Int21 so it could fit in a 32-bit location (and register), but on modern CPUs the registers are 64-bit, and then you can loop for over a day on 5 GHz CPU and still it wouldn’t overflow the register. Then finally you could do the split nanosecond runtime check at the end of the loop iteration, much beating the (possibly 15%, at least worst case) overhead you would get with Base.checked_add on every increment. It’s just an example, you lose this advantage with larger types which you might go to. But the principle is the same, you could unroll the loops a few times and check every 16 loop iterations and it would be safe, for arbitrary loop counts, given types up to Int47. Everything larger could go to BigInt.