-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the RecordReader to pick incoming time column #3895
Conversation
08f4933
to
da22f6f
Compare
Codecov Report
@@ Coverage Diff @@
## master #3895 +/- ##
============================================
- Coverage 67.23% 67.21% -0.03%
Complexity 4 4
============================================
Files 1032 1032
Lines 50897 50884 -13
Branches 7109 7108 -1
============================================
- Hits 34220 34200 -20
- Misses 14339 14344 +5
- Partials 2338 2340 +2
Continue to review full report at Codecov.
|
Out of curiosity, why do we allow different incoming and outgoing time column names? What would be the use case that requires this feature? I think that it's better to enforce those two column names to be the same. |
if (isPinotFieldSingleValue != isAvroFieldSingleValue) { | ||
String errorMessage = "Pinot field: " + fieldName + " is " + (isPinotFieldSingleValue ? "Single" : "Multi") | ||
+ "-valued in Pinot schema but not in Avro schema"; | ||
if (fieldSpec.getFieldType() == FieldSpec.FieldType.TIME) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to avoid this repeating validation logic for each record reader? This makes hard for people to add a new type of record reader. For instance, outside contributor is working on adding a parquet reader #3852
One approach would be always try to read incoming & outgoing time column values and do the validation based on GenericRow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Now most of the common logic are maintained in RecordReaderUtils.
@@ -43,6 +43,7 @@ public static RecordReader getRecordReader(SegmentGeneratorConfig segmentGenerat | |||
return new CSVRecordReader(dataFile, schema, (CSVRecordReaderConfig) segmentGeneratorConfig.getReaderConfig()); | |||
case JSON: | |||
return new JSONRecordReader(dataFile, schema); | |||
// TODO: PinotSegmentRecordReader abd ThriftRecordReader do not support default value or different incoming/outgoing time column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abd -> and
@snleee For example, incoming time column name is millisSinceEpoch with unit MILLISECONDS, it is confusing to convert this column to DAYS without changing the name to daysSinceEpoch. |
cc79eef
to
a61c1ea
Compare
Fix AvroRecordReader, CSVRecordReader, JSONRecordReader, ThriftRecordReader to be able to read incoming time column Move common util methods into RecordReaderUtils to achieve: - If incoming and outgoing time column name are the same, use incoming field spec (data type) - If incoming and outgoing time column name are different, try reading both of them - If no value provided, do not fill default value for time column (don't allow default time column) Other fixes: - Fix the issue where AvroRecordReader does not convert the value into data type from schema - Support reading empty string for all record readers (to be the same as AvroRecordReader behavior) - Make ThriftRecordReader behave the same as other record readers Add tests for all scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we need to handle this for all the concrete classes of RecordReader
. Can we leave some comments in API doc of RecordReader
, so that when we add more concrete class, we follow the same convention?
pinot-core/src/main/java/org/apache/pinot/core/data/readers/AvroRecordReader.java
Show resolved
Hide resolved
a61c1ea
to
5854f5e
Compare
@jackjlli Added javadoc to |
.addSingleValueDimension("unknown_dimension", FieldSpec.DataType.STRING) | ||
.addMetric("met_impressionCount", FieldSpec.DataType.LONG).addMetric("unknown_metric", FieldSpec.DataType.DOUBLE) | ||
.build(); | ||
private final Schema SCHEMA_SAME_INCOMING_OUTGOING = new Schema.SchemaBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure the logic here. SCHEMA_SAME_INCOMING_OUTGOING
should have the same TimeUnit? But here one is in Days and the other is in Seconds. Or could you add some comments here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments
private final Schema SCHEMA_DIFFERENT_INCOMING_OUTGOING = new Schema.SchemaBuilder() | ||
.addTime("time_day", TimeUnit.SECONDS, FieldSpec.DataType.LONG, "column2", TimeUnit.DAYS, FieldSpec.DataType.INT) | ||
.build(); | ||
private final Schema SCHEMA_NO_INCOMING = new Schema.SchemaBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as here. If the schema has no incoming, why do we need "incoming", TimeUnit.SECONDS, FieldSpec.DataType.LONG
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise. Thank you for making the change for having a shared check for incoming, outgoing time field spec. This looks much cleaner.
value = RecordReaderUtils.convertToDataTypeArray((ArrayList) jsonValue, fieldSpec); | ||
Object value = record.get(fieldName); | ||
// Allow default value for non-time columns | ||
if (value != null || fieldSpec.getFieldType() != FieldSpec.FieldType.TIME) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of ignoring null
time column values, maybe it's better to throw the exception? How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot throw exception because it is valid if only one of incoming or outgoing column is not in the record. The time converter can catch this and throw exception if no time column is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for working on this.
Fix AvroRecordReader, CSVRecordReader, JSONRecordReader, ThriftRecordReader to be able to read incoming time column
Move common util methods into RecordReaderUtils to achieve:
Other fixes:
Add tests for all scenarios