Converting UInt8 Array to BigInt, and back


#1

I am writing functions that would convert Array{UInt8,1} to BigInt and vice versa.

  1. From BigInt to Array I use to do as follows, through an hexadecimal string.
function int2bytes(x::Integer)
    hex = string(x, base=16)
    if mod(length(hex), 2) != 0
        hex = string("0", hex)
    end
    return hex2bytes(hex)
end

@gdkrmr suggested the following faster implementation which I am now using.

function int2bytes(x::BigInt)
    result = Array{UInt8}(undef, x.size * sizeof(eltype(x.d)))
    unsafe_copyto!(convert(Ptr{eltype(x.d)}, pointer(result)), x.d, x.size)
    if ENDIAN_BOM == 0x04030201
        result = result[end:-1:1]
    end
    i = findfirst(x -> x != 0x00, result)
    result[i:end]
end
  1. From Array to BigInt, I am also going through hexadecimal string at the moment.
function bytes2int(x::Array{UInt8,1})
    hex = bytes2hex(x)
    return parse(BigInt, hex, base=16)
end

Based on @gdkrmr suggestion I am now trying to implement this conversion without string but can’t make it work… I came up with the below code which works when broken down in REPL but crashes when running it in a function.

function bytes2int_new(x::Array{UInt8,1})
    l = length(x)
    if Sys.WORD_SIZE == 64
        T = UInt64
        xsize = cld(l, 8)
    elseif Sys.WORD_SIZE == 32
        T = UInt32
        xsize = cld(l, 4)
    end
    x = cat(fill(UInt8(0), xsize * 8 - l), x; dims=1)
    if ENDIAN_BOM == 0x04030201
        reverse!(x)
    end
    if xsize == 1
        return reinterpret(T,x)
    else
        result = big(0)
        unsafe_copyto!(result.d, convert(Ptr{T}, pointer(x)), xsize)
        result.size = xsize
        result.alloc = xsize + 1
        return result
    end
end

Clearly, there is something I do not understand about pointers and BigInt implementation…

First of all, is it at all possible to create a BigInt by assigning value to its fields?
How to properly assign those value?
What does the alloc field represent?

I may also be on a wrong path for this function, so any suggestions to avoid string representation is welcome for this function


#2

If x is a vector of UInt8, then the preferred way to convert it to a vector of BigInt should be BigInt.(x) (I guess this ends up calling some C-function in GMP, probably after converting from UInt8 to Int.)

Is that not fast enough?


#3

I think the question is about converting an array of UInt8 to a single BigInt, not to an array of BigInts.


#4

Oh. I see. Carry on.


#5

Yes, but proper initialization must be taken care of.
The problem in your implementation is likely that you are writing with unsafe_copyto! in memory regions which have not be allocated.
To create a BigInt which can represent a number with n bits, you can use a = Base.GMP.MPZ.realloc2(n). Then you don’t need to touch the .alloc field of a, and you can write n bits at a.d (or cld(n, 64) UInt64 words if the architecture is 64 bits); you also need to manually update the .size field of a.

Note also that this is internal API and will break eventually.


#6

This one is even faster:

function int2bytes3(x::BigInt)
    n_bytes_with_zeros = x.size * sizeof(Sys.WORD_SIZE)
    uint8_ptr = convert(Ptr{UInt8}, x.d)

    if ENDIAN_BOM == 0x04030201
        # the minimum should be 1, else the result array will be of
        # length 0
        n_bytes_without_zeros = 1
        for i in n_bytes_with_zeros:-1:1
            if unsafe_load(uint8_ptr, i) != 0x00
                n_bytes_without_zeros = i
                break
            end
        end

        result = Array{UInt8}(undef, n_bytes_without_zeros)

        for i in 1:n_bytes_without_zeros
            @inbounds result[n_bytes_without_zeros + 1 - i] = unsafe_load(uint8_ptr, i)
        end

        return result
    else
        error("not implemented... yet")
    end

end

#7

If this is calling the gmp library function I think it is calling, beware with the size, as n is limited by the library to INT_MAX. This caught me off guard a while back trying to compute A(4,3) of a two argument version of the Ackermann function… So much for “arbitrary size” :confused:

I have plans to create a (hopefully) pure Julia arbitrary size package, but it’s not even in it’s infancy…


#8

Thanks, it is working as a charm, 3 order of magnitude faster than the original code :slight_smile:
I did remove the step where it was filling Array with leading zero, proven useless and resources hungry.

Final code for reference:

function bytes2big(x::Array{UInt8,1})
    l = length(x)
    if Sys.WORD_SIZE == 64
        T = UInt64
        xsize = cld(l, 8)
    elseif Sys.WORD_SIZE == 32
        T = UInt32
        xsize = cld(l, 4)
    end
    if ENDIAN_BOM == 0x04030201
        reverse!(x)
    end
    result = Base.GMP.MPZ.realloc2(xsize * Sys.WORD_SIZE)
    result.size = xsize
    unsafe_copyto!(result.d, convert(Ptr{T}, pointer(x)), xsize)
    return result
end

Noted for future compatibility “issues”, I’ll keep an eye on it


#9

Good to know!

That would come handy, especially if the GMP API beaks at some point :slight_smile:


#10

Are there in-place operations in libgmp and are they exposed to julia somewhere?


#11

Yes, in the Base.GMP.MPZ module. Not all functions are exposed, you have to lookup that source file to know which ones are. In-place operations end with the ! suffix as is usual in Julia.

@roshii Also, I didn’t think about that earlier, but you probably should use Base.GMP.BITS_PER_LIMB and Base.GMP.Limb to refer to the architecture-dependant number of bits per “limb” and “limb” type, it may not correspond to Sys.WORD_SIZE on some architectures (Windows?).


#12

Lovely, thank again :slight_smile:
I assume the case for small endians would look as follows.

...
    else
        n_bytes_without_zeros = 1
        for i in 1:n_bytes_with_zeros
            if unsafe_load(uint8_ptr, i) != 0x00
                n_bytes_without_zeros = i
                break
            end
        end

        result = Array{UInt8}(undef, n_bytes_without_zeros)

        for i in 1:n_bytes_without_zeros
            @inbounds result[i] = unsafe_load(uint8_ptr, i)
        end
    end
    return result
end

#13

Windows…
Point taken, code looks even cleaner with it.

function bytes2big(x::Array{UInt8,1})
    xsize = cld(length(x), Base.GMP.BITS_PER_LIMB / 8)
    if ENDIAN_BOM == 0x04030201
        reverse!(x)
    end
    result = Base.GMP.MPZ.realloc2(xsize * Base.GMP.BITS_PER_LIMB)
    result.size = xsize
    unsafe_copyto!(result.d, convert(Ptr{Base.GMP.Limb}, pointer(x)), xsize)
    return result
end

#14

If you decide to go with the realloc2 approach, check the function in Base for error checking and thrown errors - libgmp will just signal abort if the new allocated size is too big, killing the calling process, so make sure that doesn’t happen! I don’t know off the top of my head if that particular function checks for problems before calling into libgmp.


#15

I indeed notice some strange behavior:

julia> x = [rand(UInt8)]
1-element Array{UInt8,1}:
 0x7d

julia> for i in 1:10
       println(bytes2big(x))
       end
125
125
125
125
125
125
125
125
125
125

julia> for i in 1:10
       println(bytes2big(int2bytes(125)))
       end
139767316545661
139767292756093
18446744073709486205
139767289413757
139767316545661
139767312023677
139766825746557
125
139767316545661
139767292756093

I assume I’d need to free pointer reference at some point in time but I’m just struggling here… Would anyone have a clue on what is causing result to vary like that?
How can I ensure a stable result?


#16

Experiencing same randomness with the following:

julia> for i in 1:10
              println(bytes2big([0x02]))
              end
139654643580930
18446744073709486082
139654643580930
2
18446744073709486082
139654697844738
2
18446744073709486082
139654697844738
139654643580930

#17

Some time ago I started something like this, but it turned out to require more work than I was willing to put into it: