Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hex string as the representation of byte array for queries #4041

Merged
merged 3 commits into from
Apr 1, 2019

Conversation

xiangfu0
Copy link
Contributor

Use hex string to represent byte[] value in query.
Refer to #4040

@xiangfu0 xiangfu0 force-pushed the hex_string_for_byte_array branch from 0a7f4bf to 807e15c Compare March 30, 2019 01:42
byte[] bytes;
// Convert hex string to byte[].
if (rawValue instanceof String) {
bytes = DatatypeConverter.parseHexBinary((String) rawValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other parts of the code use org.apache.commons.codec.binary.Hex for encoding/decoding byte-array to String. Either one is fine, but all code should use the same encoding/decoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, updated with org.apache.commons.codec.binary.Hex api.

@codecov-io
Copy link

codecov-io commented Mar 30, 2019

Codecov Report

Merging #4041 into master will increase coverage by 0.06%.
The diff coverage is 61.11%.

Impacted file tree graph

@@             Coverage Diff             @@
##             master   #4041      +/-   ##
===========================================
+ Coverage     67.24%   67.3%   +0.06%     
  Complexity       20      20              
===========================================
  Files          1033    1033              
  Lines         51338   51366      +28     
  Branches       7181    7191      +10     
===========================================
+ Hits          34524   34574      +50     
+ Misses        14432   14401      -31     
- Partials       2382    2391       +9
Impacted Files Coverage Δ Complexity Δ
...impl/dictionary/BytesOffHeapMutableDictionary.java 76.36% <58.82%> (-4.59%) 0 <0> (ø)
.../impl/dictionary/BytesOnHeapMutableDictionary.java 75% <63.15%> (-4.32%) 0 <0> (ø)
...a/manager/realtime/RealtimeSegmentDataManager.java 50% <0%> (-50%) 0% <0%> (ø)
...er/validation/BrokerResourceValidationManager.java 50% <0%> (-31.25%) 0% <0%> (ø)
...cheduler/resources/PolicyBasedResourceManager.java 75% <0%> (-18.75%) 0% <0%> (ø)
...ore/query/scheduler/resources/ResourceManager.java 85.71% <0%> (-10.72%) 0% <0%> (ø)
...pache/pinot/core/util/SortedRangeIntersection.java 83.82% <0%> (-7.36%) 0% <0%> (ø)
...impl/dictionary/DoubleOnHeapMutableDictionary.java 68.88% <0%> (-6.67%) 0% <0%> (ø)
...mpl/dictionary/DoubleOffHeapMutableDictionary.java 72.72% <0%> (-5.46%) 0% <0%> (ø)
...e/operator/dociditerators/MVScanDocIdIterator.java 60.6% <0%> (-3.04%) 0% <0%> (ø)
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c71c24...6f34c39. Read the comment docs.

@xiangfu0 xiangfu0 force-pushed the hex_string_for_byte_array branch from 807e15c to 79305e8 Compare March 30, 2019 05:40
@kishoreg
Copy link
Member

can we add a test case?

@xiangfu0
Copy link
Contributor Author

can we add a test case?

Modified existing dictionary test to let BYTES dictionary index values contain half byte[] and half hex String.

Copy link
Contributor

@mayankshriv mayankshriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added minor comments, please address them before committing.

@@ -47,13 +51,24 @@
public BytesOffHeapMutableDictionary(int estimatedCardinality, int maxOverflowHashSize,
PinotDataBufferMemoryManager memoryManager, String allocationContext, int avgLength) {
super(estimatedCardinality, maxOverflowHashSize, memoryManager, allocationContext);
_byteStore = new MutableOffHeapByteArrayStore(memoryManager, allocationContext, estimatedCardinality, avgLength);
_byteStore = new MutableOffHeapByteArrayStore(memoryManager, allocationContext,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: This seems like a codestyle diff? Please ensure to use the recommend code style. https://pinot.readthedocs.io/en/latest/dev_env.html#intellij

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

byte[] bytes = (byte[]) rawValue;
byte[] bytes = null;
// Convert hex string to byte[].
if (rawValue instanceof String) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the common case be of String of byte[]? Common case should be the first 'if' condition (same in other api's as well).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common case is still byte[], just wanna shorten code to still keep assert for other data types, so put String type first. I will change the if logic below to move the common case up.

if (rawValue instanceof byte[]) {
...
} if (rawValue instanceof String) {
...
} else {
assert rawValue instanceof byte[];
}

@mcvsubbu
Copy link
Contributor

mcvsubbu commented Apr 1, 2019

Can you add some motivation behind the change? I am guessing (but not sure) that we want to ingest string data that are known to be hexbytes. Is that right? Is this a part of a bigger design document? If so, can you point me to one (or, add one under https://cwiki.apache.org/confluence/display/PINOT/Design+Documents)
thanks

@xiangfu0
Copy link
Contributor Author

xiangfu0 commented Apr 1, 2019

Can you add some motivation behind the change? I am guessing (but not sure) that we want to ingest string data that are known to be hexbytes. Is that right? Is this a part of a bigger design document? If so, can you point me to one (or, add one under https://cwiki.apache.org/confluence/display/PINOT/Design+Documents)
thanks

My major motivation is related to this issue: #4040.
Basically I want to ingest data like uid as bytes format and write a lookup query on top of it. Since hex string is used to display bytes type in query results, I extend this concept and make hex string/bytes both accepted for this column.

@mcvsubbu
Copy link
Contributor

mcvsubbu commented Apr 1, 2019

Can you add some motivation behind the change? I am guessing (but not sure) that we want to ingest string data that are known to be hexbytes. Is that right? Is this a part of a bigger design document? If so, can you point me to one (or, add one under https://cwiki.apache.org/confluence/display/PINOT/Design+Documents)
thanks

My major motivation is related to this issue: #4040.
Basically I want to ingest data like uid as bytes format and write a lookup query on top of it. Since hex string is used to display bytes type in query results, I extend this concept and make hex string/bytes both accepted for this column.

Is it not possible to use a string datatype for this? Assuming that the string is byte-array encoded does not seem right to me. Perhaps I am missing something here

@xiangfu0 xiangfu0 force-pushed the hex_string_for_byte_array branch from eac1351 to 6f34c39 Compare April 1, 2019 17:33
@xiangfu0
Copy link
Contributor Author

xiangfu0 commented Apr 1, 2019

Can you add some motivation behind the change? I am guessing (but not sure) that we want to ingest string data that are known to be hexbytes. Is that right? Is this a part of a bigger design document? If so, can you point me to one (or, add one under https://cwiki.apache.org/confluence/display/PINOT/Design+Documents)
thanks

My major motivation is related to this issue: #4040.
Basically I want to ingest data like uid as bytes format and write a lookup query on top of it. Since hex string is used to display bytes type in query results, I extend this concept and make hex string/bytes both accepted for this column.

Is it not possible to use a string datatype for this? Assuming that the string is byte-array encoded does not seem right to me. Perhaps I am missing something here

This is to match the data in select query. Basically in select uid from myTable limit 10 query, I got a random row with uid value shown as "c8b3bce0b378fc5ce8067fc271a34892" which is a hex string. And if I want to fetch this row with predicate, I should use the same hex string in my query, say
select * from myTable where uid="c8b3bce0b378fc5ce8067fc271a34892"

@xiangfu0 xiangfu0 merged commit 202009f into master Apr 1, 2019
@xiangfu0 xiangfu0 deleted the hex_string_for_byte_array branch April 1, 2019 21:56
xiangfu0 added a commit to winedepot/pinot that referenced this pull request Apr 1, 2019
…e#4041)

* Use hex string as the representation of byte array for queries

* Adding test for indexing hexstring

* Address comments
xiangfu0 added a commit to winedepot/pinot that referenced this pull request Apr 1, 2019
…e#4041) (#4)

* Use hex string as the representation of byte array for queries

* Adding test for indexing hexstring

* Address comments
Copy link
Contributor

@mcvsubbu mcvsubbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fx19880617 can you please add a query example in https://pinot.readthedocs.io/en/latest/pql_examples.html

@apache apache deleted a comment from xiangfu1 Jun 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants