Need for a Method hex2bytes(Vector{UInt8})

sambitdash · August 3, 2017, 6:07pm

Hi All,

Many a times hex encoded bytes are received as part of IO or network operations. And that will need to be converted to bytes for further processing. Current method expects a conversion to a form of AbstractString before proceeding further.

julia> b=[UInt8('a'),UInt8('b'),UInt8('c'),UInt8('d')]
4-element Array{UInt8,1}:
 0x61
 0x62
 0x63
 0x64

julia> b |> String |> hex2bytes
2-element Array{UInt8,1}:
 0xab
 0xcd

julia> b |> hex2bytes
ERROR: MethodError: no method matching hex2bytes(::Array{UInt8,1})
Closest candidates are:
  hex2bytes(::AbstractString) at strings/util.jl:445
Stacktrace:
 [1] |>(::Array{UInt8,1}, ::Base.#hex2bytes) at ./operators.jl:904

I feel the 3rd option may be helpful as some of the data IO operations can be substantial if the files are large and String conversion is an unnecessary operation.

Please share your ideas or if there is an alternate to address this.

regards,

Sambit

stevengj · August 3, 2017, 7:52pm

A hex2bytes(::Vector{UInt8}) method seems perfectly reasonable to add, if you feel like submitting a PR.

See also https://github.com/JuliaLang/julia/issues/14418 for more general discussion of these hex methods.

sambitdash · August 6, 2017, 4:37pm

I tried a few tests with the same implementation approach as the hex2bytes(String) but as expected the performance was much worse than the String version. One reason I can think of is the byte arithmetic unless normalized to use the best of SIMD approaches may not be as efficient. Hence, did not proceed further on a PR. I will create the issue now. But if time permits I will attempt with a better algorithm.

sambitdash · August 6, 2017, 5:55pm

@stevengj

Here are various code I tried and forcing the computation to word boundary is almost giving the performance as expected in the hex2bytes(String) version.

If you think this is accepted performance, I can submit a PR for hex2bytes(::Vector{UInt8}) with from_hexstringB1() as the solution.

Benchmark numbers below:

julia> @benchmark f_hsS1() # Current hex2bytes(String)
BenchmarkTools.Trial: 
  memory estimate:  5.00 MiB
  allocs estimate:  2
  --------------
  minimum time:     54.510 ms (0.00% GC)
  median time:      54.870 ms (0.00% GC)
  mean time:        55.103 ms (0.12% GC)
  maximum time:     59.209 ms (0.00% GC)
  --------------
  samples:          91
  evals/sample:     1

julia> @benchmark f_hsB1() # With word boundary computation
BenchmarkTools.Trial: 
  memory estimate:  5.00 MiB
  allocs estimate:  2
  --------------
  minimum time:     46.227 ms (0.00% GC)
  median time:      46.959 ms (0.00% GC)
  mean time:        47.045 ms (0.14% GC)
  maximum time:     51.345 ms (0.00% GC)
  --------------
  samples:          107
  evals/sample:     1

julia> @benchmark f_hsB2() # With byte boundary computation
BenchmarkTools.Trial: 
  memory estimate:  84.99 MiB
  allocs estimate:  5242371
  --------------
  minimum time:     434.784 ms (0.72% GC)
  median time:      435.930 ms (0.73% GC)
  mean time:        443.935 ms (1.91% GC)
  maximum time:     513.806 ms (13.27% GC)
  --------------
  samples:          12
  evals/sample:     1

sambitdash · August 6, 2017, 5:56pm

Source code attached:

# This file is intended to be part of Julia. License is MIT: https://julialang.org/license

using BenchmarkTools
using Base.Test

to_hexstring(arr::Array{UInt8,1}) = join([hex(i, 2) for i in arr])

# This is the default benchmark for the calculation

from_hexstringS1(s::AbstractString) = hex2bytes(s)

# This uses the same logic as hex2bytes

function from_hexstringB1(s::Vector{UInt8})
    const DIGIT_ZERO     = UInt('0')
    const DIGIT_NINE     = UInt('9')
    const LATIN_UPPER_A  = UInt('A')
    const LATIN_UPPER_F  = UInt('F')
    const LATIN_A        = UInt('a')
    const LATIN_F        = UInt('f')

    len = length(s)
    if isodd(len)
        error("Input string length should be even")
    end
    arr = zeros(UInt8, div(len,2))
    i = j = 1
    # This line is important as this ensures computation happens in word boundary and not
    # byte boundary. Byte boundary computation can be almost 10 times slower.
    n = c = UInt(0)
    while i < len
        n = 0
        c = s[i]
        n = DIGIT_ZERO <= c <= DIGIT_NINE ? c - DIGIT_ZERO :
            LATIN_A <= c <= LATIN_F ? c - LATIN_A + 10 :
            LATIN_UPPER_A <= c <= LATIN_UPPER_F ? c - LATIN_UPPER_A + 10 : error("Input string isn't a hexadecimal string")
        i += 1
        c = s[i]
        n = DIGIT_ZERO <= c <= DIGIT_NINE ? n << 4 + c - DIGIT_ZERO :
            LATIN_A <= c <= LATIN_F ? n << 4 + c - LATIN_A + 10 :
            LATIN_UPPER_A <= c <= LATIN_UPPER_F ? n << 4 + c - LATIN_UPPER_A + 10 : error("Input string isn't a hexadecimal string")
        i += 1
        arr[j] = n
        j += 1
    end
    return arr
end

function from_hexstringB2(s::Vector{UInt8})
    const DIGIT_ZERO     = UInt8('0')
    const DIGIT_NINE     = UInt8('9')
    const LATIN_UPPER_A  = UInt8('A')
    const LATIN_UPPER_F  = UInt8('F')
    const LATIN_A        = UInt8('a')
    const LATIN_F        = UInt8('f')

    len = length(s)
    if isodd(len)
        error("Input string length should be even")
    end
    arr = zeros(UInt8, div(len,2))
    i = j = 1
    # This line is important as this ensures computation happens in word boundary and not
    # byte boundary. Boundary computation can be almost 10 times slower
    # n = c = UInt(0)
    while i < len
        n = 0
        c = s[i]
        n = DIGIT_ZERO <= c <= DIGIT_NINE ? c - DIGIT_ZERO :
            LATIN_A <= c <= LATIN_F ? c - LATIN_A + 10 :
            LATIN_UPPER_A <= c <= LATIN_UPPER_F ? c - LATIN_UPPER_A + 10 : error("Input string isn't a hexadecimal string")
        i += 1
        c = s[i]
        n = DIGIT_ZERO <= c <= DIGIT_NINE ? n << 4 + c - DIGIT_ZERO :
            LATIN_A <= c <= LATIN_F ? n << 4 + c - LATIN_A + 10 :
            LATIN_UPPER_A <= c <= LATIN_UPPER_F ? n << 4 + c - LATIN_UPPER_A + 10 : error("Input string isn't a hexadecimal string")
        i += 1
        arr[j] = n
        j += 1
    end
    return arr
end

const mb_10 = (10 << 20)

arr=UInt8[rand(['0','1','2','3','4','5','6','7','8','9','0','a','b','c','d','e','f'])
            for x = 1:mb_10]
str = String(arr)
b_str = from_hexstringS1(str)

f_hsB1() = from_hexstringB1(arr)
f_hsB2() = from_hexstringB2(arr)
f_hsS1() = from_hexstringS1(str)

@test str == to_hexstring(from_hexstringB1(arr))
@test str == to_hexstring(from_hexstringB2(arr))
@test str == to_hexstring(from_hexstringS1(str))

# Executing to make sure the code gets JIT compiled
f_hsB1()
f_hsB2()
f_hsS1()

sambitdash · August 7, 2017, 5:56am

Issue #23161 created in GitHub.

https://github.com/JuliaLang/julia/issues/23161

Topic		Replies	Views
Converting UInt8 Array to BigInt, and back New to Julia	19	3389	July 12, 2019
Hex2bytes: length of iterable must be even General Usage question , base	15	316	April 20, 2023
An old question: how do I output a vector of UInt8 in binary? New to Julia	3	460	December 1, 2023
Convert a number to a byte array New to Julia	20	1187	March 20, 2024
Newbie question - convert two 8-byte values into a single 16-byte value New to Julia	6	1597	December 11, 2017

Need for a Method hex2bytes(Vector{UInt8})

Related topics