Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hex2bytes for Vector{UInt8} #23267

Merged
merged 18 commits into from
Aug 22, 2017
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions base/exports.jl
Original file line number Diff line number Diff line change
Expand Up @@ -762,6 +762,7 @@ export
graphemes,
hex,
hex2bytes,
hex2bytes!,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need to be added to the stdlib doc index somewhere to show up in the rendered docs https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#adding-a-new-docstring-to-base

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

ind2chr,
info,
is_assigned_char,
Expand Down
78 changes: 77 additions & 1 deletion base/strings/util.jl
Original file line number Diff line number Diff line change
Expand Up @@ -463,12 +463,88 @@ function hex2bytes(s::AbstractString)
return a
end

"""
hex2bytes(s::AbstractVector{UInt8})

Convert the hexadecimal bytes array to its binary representation. Returns
`Vector{UInt8}`, i.e. a vector of bytes.
Copy link
Member

@stevengj stevengj Aug 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty unclear on what a "hexadecimal bytes vector" and "its binary representation" is. I would say something like:

Given an array `s` of ASCII codes for a sequence of hexadecimal digits, returns a `Vector{UInt8}` of
bytes  corresponding to the binary representation: each successive pair of hexadecimal digits in `s`
gives the value of one byte in the return vector.   (The length of `s` must be even, and the returned
array has half of the length of `s`.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change incorporated.

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use and example here:


# Examples
```jldoctest
julia> hex2bytes(...)
end
```

Copy link
Member

@stevengj stevengj Aug 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hex2bytes now has an example, moved from hex2bytes!.

Typically, when we have both an in-place and out-of-place version of a function, we only give examples in the simpler out-of-place version.

@inline function hex2bytes(s::AbstractVector{UInt8})
d = zeros(UInt8, div(endof(s), 2))
return hex2bytes!(d, s)
end

"""
hex2bytes!(d::AbstractVector{UInt8}, s::AbstractVector{UInt8})

Convert the hexadecimal bytes vector to its binary representation. The results are
populated into a destination vector. The function returns the destination vector.

# Examples
```jldoctest
julia> s = UInt8["01abEF"...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b"01abEF" is the simplest way to do this.

The slightly more verbose way would be Vector{UInt8}("01abEF"). The manual should not recommend using splatting.

6-element Array{UInt8,1}:
0x30
0x31
0x61
0x62
0x45
0x46

julia> d =zeros(UInt8, 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in last check-in.

3-element Array{UInt8,1}:
0x00
0x00
0x00

julia> hex2bytes!(d, s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns d now, not 3, right? The docstring above should be changed in regard to this as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also

The docstring above should be changed in regard to this as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

3-element Array{UInt8,1}:
0x01
0xab
0xef
```
"""
function hex2bytes!(d::AbstractVector{UInt8}, s::AbstractVector{UInt8})
i, j = start(s), 0
# This line is important as this ensures computation happens in word boundary and not
# byte boundary. Boundary computation can be almost 10 times slower
n::UInt = 0
c1::UInt = 0
c2::UInt = 0
Copy link
Member

@stevengj stevengj Aug 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need explicit type declarations here? Why do you need to initialize c1 and c2 and n at all? The compiler will figure out that next returns a UInt8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want to force computation to word boundary.

while !done(s, i)
n = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why initialize n to zero?

c1, i = next(s, i)
done(s, i) && throw(ArgumentError(
"string length must be even: length($(repr(s))) == $(length(s))"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this check be done on entry? Should probably not print the string, enough with the length.

Copy link
Contributor Author

@sambitdash sambitdash Aug 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check on beginning means you have a potential case of multiple passes of the buffer as there is no clarity in a AbstractVector if the size is pre-computed. And depend on the implementation details. Is there a significant benefit in precheck? Secondly the functionality cannot be achieved without a full scan due to the nature of the alogorithm used. Hence, my personal preference will be react only on failure and not check specifically for input.

Since, these are low level functions my normal approach will be to normalize or sanitize the data such that array length is made even by adding an extra zero. And bound the data btw "0x0-0xf" by applying proper filters than exception on failure. But that's a separate discussion.

c2, i = next(s, i)
n = number_from_hex(c1)
n <<= 4
n += number_from_hex(c2)
d[j+=1] = (n & 0xFF)
Copy link
Member

@stevengj stevengj Aug 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be clearer to ditch the n variable and just do:

d[j += 1] = number_from_hex(c1) << 4 + number_from_hex(c2)

I don't see why the & 0xFF mask is needed, since number_from_hex should presumably always return a 4-bit quantity (and should be changed to return a UInt8 as I suggested below).

Copy link
Contributor Author

@sambitdash sambitdash Aug 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole purpose of introducing UInt8 method was not to deteriorate performance significantly over the hex2bytes(s::AbstractString) method. And in the end it beats its performance by 10-15% at least. I guess before doing any changes to the code in terms of style and coding practices performance benchmarking be carried out. At this point I will not be able to invest further effort on this. But some of these typically compilers automatically identify and optimize but did not seem to be byte boundary computation was one of them.

The ideal code for this method should be written by taking 4 bytes at a time (sanitize the input rather than error check the input) and optimize the code for SIMD as this is supposed to be a very low level function for IO operations.

Hence, ignoring this comment unless there is a supporting benchmark data.

Copy link
Contributor Author

@sambitdash sambitdash Aug 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevengj ideally your code and my code should generate the same code with most modern compilers. With the code I have written n can easily be spotted as a register on the CPU. It's an old practice may be irrelevant in today's compilers. I have not checked the generated code so used a conservative style of coding.

(n & 0xff) was a guidance to compiler not to create an InexactError exception block. Not sure if Julia actually honors that.

end
resize!(d, j)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work with arbitrary AbstractVector, ref #23267 (comment)

Copy link
Contributor Author

@sambitdash sambitdash Aug 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependency on length removed. The array should not be printed as it an be large.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

julia> x = Vector{UInt8}(10); resize!(view(x, 1:10), 8)
ERROR: MethodError: no method matching resize!(::SubArray{UInt8,1,Array{UInt8,1},Tuple{UnitRange{Int64}},true}, ::Int64)
Closest candidates are:
  resize!(::Array{T,1} where T, ::Integer) at array.jl:1020
  resize!(::BitArray{1}, ::Integer) at bitarray.jl:836

return d
end

@inline function number_from_hex(c::UInt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this UInt and not UInt8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it seems like this function should return a UInt8 as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-word boundary computation can be 10 times slower or that's what was observed.

DIGIT_ZERO = UInt('0')
DIGIT_NINE = UInt('9')
LATIN_UPPER_A = UInt('A')
LATIN_UPPER_F = UInt('F')
LATIN_A = UInt('a')
LATIN_F = UInt('f')
Copy link
Member

@stevengj stevengj Aug 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think defining these constants adds much to the clarity (on the contrary, it makes the code longer and hence harder to read); it would be more compact and at least as readable if you used UInt8('0') etcetera directly in the code.

Copy link
Contributor Author

@sambitdash sambitdash Aug 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is again an issue of style and personal choice. Some people find balancing parenthesis an unnecessary overhead in code. Secondly, it gives the flexibility of changing the datatypes for someone at one place than carrying out the changes in code blocks. I could change all datatypes to UInt8 very easily if I have code written this way.


return (DIGIT_ZERO <= c <= DIGIT_NINE) ? c - DIGIT_ZERO :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 nested tertiaries is a bit hard to reason about. Perhaps something like

DIGIT_ZERO    <= c <= DIGIT_NINE    && return c - DIGIT_ZERO
LATIN_UPPER_A <= c <= LATIN_UPPER_F && return c - LATIN_UPPER_A + 10
LATIN_A       <= c <= LATIN_F       && return c - LATIN_A + 10
throw(ArgumentError("not a hexadecimal number: '$(Char(c))'"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a matter of personal preference and choice. A few lines above in hex2bytes(AbstractVector) has 3 nested tertiaries. Hence ignoring this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a matter of personal preference and choice. A few lines above in hex2bytes(AbstractVector) has 3 nested tertiaries. Hence ignoring this comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can fix it up at some other point, doesn't need to block this PR.

(LATIN_UPPER_A <= c <= LATIN_UPPER_F) ? c - LATIN_UPPER_A + 10 :
(LATIN_A <= c <= LATIN_F) ? c - LATIN_A + 10 :
throw(ArgumentError("Not a hexadecimal number"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps "$c is not a hexadecimal number"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

end

"""
bytes2hex(bin_arr::Array{UInt8, 1}) -> String

Convert an array of bytes to its hexadecimal representation.
All characters are in lower-case.

# Examples
```jldoctest
julia> a = hex(12345)
Expand Down
16 changes: 16 additions & 0 deletions test/strings/util.jl
Original file line number Diff line number Diff line change
Expand Up @@ -247,3 +247,19 @@ bin_val = hex2bytes("07bf")

#non-hex characters
@test_throws ArgumentError hex2bytes("0123456789abcdefABCDEFGH")

@testset "Issue 23161" begin
arr = UInt8["0123456789abcdefABCDEF"...]
arr1 = Vector{UInt8}(11)
@test hex2bytes!(arr1, arr) == arr1
@test "0123456789abcdefabcdef" == bytes2hex(arr1)
@test hex2bytes("0123456789abcdefABCDEF") == hex2bytes(arr)
@test hex2bytes!(arr1, UInt8[""...]) == arr1
@test hex2bytes(UInt8[""...]) == UInt8[]

# odd size
@test_throws ArgumentError hex2bytes(UInt8["0123456789abcdefABCDEF0"...])

#non-hex characters
@test_throws ArgumentError hex2bytes(UInt8["0123456789abcdefABCDEFGH"...])
end