Need for a Method hex2bytes(Vector{UInt8})


#1

Hi All,

Many a times hex encoded bytes are received as part of IO or network operations. And that will need to be converted to bytes for further processing. Current method expects a conversion to a form of AbstractString before proceeding further.

julia> b=[UInt8('a'),UInt8('b'),UInt8('c'),UInt8('d')]
4-element Array{UInt8,1}:
 0x61
 0x62
 0x63
 0x64

julia> b |> String |> hex2bytes
2-element Array{UInt8,1}:
 0xab
 0xcd

julia> b |> hex2bytes
ERROR: MethodError: no method matching hex2bytes(::Array{UInt8,1})
Closest candidates are:
  hex2bytes(::AbstractString) at strings/util.jl:445
Stacktrace:
 [1] |>(::Array{UInt8,1}, ::Base.#hex2bytes) at ./operators.jl:904

I feel the 3rd option may be helpful as some of the data IO operations can be substantial if the files are large and String conversion is an unnecessary operation.

Please share your ideas or if there is an alternate to address this.

regards,

Sambit


#2

A hex2bytes(::Vector{UInt8}) method seems perfectly reasonable to add, if you feel like submitting a PR.

See also https://github.com/JuliaLang/julia/issues/14418 for more general discussion of these hex methods.


#3

I tried a few tests with the same implementation approach as the hex2bytes(String) but as expected the performance was much worse than the String version. One reason I can think of is the byte arithmetic unless normalized to use the best of SIMD approaches may not be as efficient. Hence, did not proceed further on a PR. I will create the issue now. But if time permits I will attempt with a better algorithm.


#4

@stevengj

Here are various code I tried and forcing the computation to word boundary is almost giving the performance as expected in the hex2bytes(String) version.

If you think this is accepted performance, I can submit a PR for hex2bytes(::Vector{UInt8}) with from_hexstringB1() as the solution.

Benchmark numbers below:

julia> @benchmark f_hsS1() # Current hex2bytes(String)
BenchmarkTools.Trial: 
  memory estimate:  5.00 MiB
  allocs estimate:  2
  --------------
  minimum time:     54.510 ms (0.00% GC)
  median time:      54.870 ms (0.00% GC)
  mean time:        55.103 ms (0.12% GC)
  maximum time:     59.209 ms (0.00% GC)
  --------------
  samples:          91
  evals/sample:     1

julia> @benchmark f_hsB1() # With word boundary computation
BenchmarkTools.Trial: 
  memory estimate:  5.00 MiB
  allocs estimate:  2
  --------------
  minimum time:     46.227 ms (0.00% GC)
  median time:      46.959 ms (0.00% GC)
  mean time:        47.045 ms (0.14% GC)
  maximum time:     51.345 ms (0.00% GC)
  --------------
  samples:          107
  evals/sample:     1

julia> @benchmark f_hsB2() # With byte boundary computation
BenchmarkTools.Trial: 
  memory estimate:  84.99 MiB
  allocs estimate:  5242371
  --------------
  minimum time:     434.784 ms (0.72% GC)
  median time:      435.930 ms (0.73% GC)
  mean time:        443.935 ms (1.91% GC)
  maximum time:     513.806 ms (13.27% GC)
  --------------
  samples:          12
  evals/sample:     1

#5

Source code attached:

# This file is intended to be part of Julia. License is MIT: https://julialang.org/license

using BenchmarkTools
using Base.Test

to_hexstring(arr::Array{UInt8,1}) = join([hex(i, 2) for i in arr])

# This is the default benchmark for the calculation

from_hexstringS1(s::AbstractString) = hex2bytes(s)

# This uses the same logic as hex2bytes

function from_hexstringB1(s::Vector{UInt8})
    const DIGIT_ZERO     = UInt('0')
    const DIGIT_NINE     = UInt('9')
    const LATIN_UPPER_A  = UInt('A')
    const LATIN_UPPER_F  = UInt('F')
    const LATIN_A        = UInt('a')
    const LATIN_F        = UInt('f')

    len = length(s)
    if isodd(len)
        error("Input string length should be even")
    end
    arr = zeros(UInt8, div(len,2))
    i = j = 1
    # This line is important as this ensures computation happens in word boundary and not
    # byte boundary. Byte boundary computation can be almost 10 times slower.
    n = c = UInt(0)
    while i < len
        n = 0
        c = s[i]
        n = DIGIT_ZERO <= c <= DIGIT_NINE ? c - DIGIT_ZERO :
            LATIN_A <= c <= LATIN_F ? c - LATIN_A + 10 :
            LATIN_UPPER_A <= c <= LATIN_UPPER_F ? c - LATIN_UPPER_A + 10 : error("Input string isn't a hexadecimal string")
        i += 1
        c = s[i]
        n = DIGIT_ZERO <= c <= DIGIT_NINE ? n << 4 + c - DIGIT_ZERO :
            LATIN_A <= c <= LATIN_F ? n << 4 + c - LATIN_A + 10 :
            LATIN_UPPER_A <= c <= LATIN_UPPER_F ? n << 4 + c - LATIN_UPPER_A + 10 : error("Input string isn't a hexadecimal string")
        i += 1
        arr[j] = n
        j += 1
    end
    return arr
end

function from_hexstringB2(s::Vector{UInt8})
    const DIGIT_ZERO     = UInt8('0')
    const DIGIT_NINE     = UInt8('9')
    const LATIN_UPPER_A  = UInt8('A')
    const LATIN_UPPER_F  = UInt8('F')
    const LATIN_A        = UInt8('a')
    const LATIN_F        = UInt8('f')

    len = length(s)
    if isodd(len)
        error("Input string length should be even")
    end
    arr = zeros(UInt8, div(len,2))
    i = j = 1
    # This line is important as this ensures computation happens in word boundary and not
    # byte boundary. Boundary computation can be almost 10 times slower
    # n = c = UInt(0)
    while i < len
        n = 0
        c = s[i]
        n = DIGIT_ZERO <= c <= DIGIT_NINE ? c - DIGIT_ZERO :
            LATIN_A <= c <= LATIN_F ? c - LATIN_A + 10 :
            LATIN_UPPER_A <= c <= LATIN_UPPER_F ? c - LATIN_UPPER_A + 10 : error("Input string isn't a hexadecimal string")
        i += 1
        c = s[i]
        n = DIGIT_ZERO <= c <= DIGIT_NINE ? n << 4 + c - DIGIT_ZERO :
            LATIN_A <= c <= LATIN_F ? n << 4 + c - LATIN_A + 10 :
            LATIN_UPPER_A <= c <= LATIN_UPPER_F ? n << 4 + c - LATIN_UPPER_A + 10 : error("Input string isn't a hexadecimal string")
        i += 1
        arr[j] = n
        j += 1
    end
    return arr
end

const mb_10 = (10 << 20)

arr=UInt8[rand(['0','1','2','3','4','5','6','7','8','9','0','a','b','c','d','e','f'])
            for x = 1:mb_10]
str = String(arr)
b_str = from_hexstringS1(str)

f_hsB1() = from_hexstringB1(arr)
f_hsB2() = from_hexstringB2(arr)
f_hsS1() = from_hexstringS1(str)

@test str == to_hexstring(from_hexstringB1(arr))
@test str == to_hexstring(from_hexstringB2(arr))
@test str == to_hexstring(from_hexstringS1(str))

# Executing to make sure the code gets JIT compiled
f_hsB1()
f_hsB2()
f_hsS1()


#6

Issue #23161 created in GitHub.

https://github.com/JuliaLang/julia/issues/23161