-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type #103
Conversation
@sandeepreddy602 @chenlica Please review. |
fieldName, (String) fieldValue, Store.YES); | ||
break; | ||
|
||
org.apache.lucene.document.FieldType luceneFieldType = new org.apache.lucene.document.FieldType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comments to the codebase: "By default we enable positional indexing in Lucene so that we can return information about character offsets and token offsets.""
I left a few minor comments. It will be good if @sandeepreddy602 can also review it quickly. Then you can do the merge. |
@chenlica @sandeepreddy602 @prakul Please Review. |
this.fieldName = fieldName; | ||
this.start = start; | ||
this.end = end; | ||
this.key = key; | ||
this.value = value; | ||
this.tokenOffset = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give "-1" to a meaning constant such as "INVALID_TOKEN_OFFSET"?
Added @zuozhi and @rajesh9625 in case they are interested to review as well. |
@@ -38,7 +38,7 @@ public boolean equals(Object obj) { | |||
if (list == null) { | |||
if (other.list != null) | |||
return false; | |||
} else if (!list.equals(other.list)) | |||
} else if (!list.containsAll(other.list)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use equals()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because even though the list has same number of elements, with same values, equals return false, where as containsAll returns True
other.list.containsAll(list) Returns True;
list.containsAll(others.list) Returns True;
other.list.equals(list) Returns False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because equals()
cares about order, why we don't? If so, then the name "list" is not accurate. Should we rename it to "SetField"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the "list" name not accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not following. Why "other.list.equals(list) Returns False"?
return dTuple; | ||
} | ||
|
||
for(Attribute attr: attributeList){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not following this code. Comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the comments in the code.
I gave some comments. Please pay attention to the coding style. Also add necessary comments. I felt this PR is a little too big to review. Is it possible to split it into two PRs (after you take care of some comments)? |
Even I wanted to divide the PR into two, but the changes into the DataReader led to failure of the test cases of the existing operators. Hence had to change them too. |
private List<BytesRef> queryTokensInBytesRef; | ||
// The schema of the data tuple | ||
private Schema schema; | ||
//The schema o the data tuple along with the span information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix "o". Please be more careful about comments :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
OK not to split this PR into two PRs. |
// This makes the seek faster. | ||
this.queryTokens.sort(String.CASE_INSENSITIVE_ORDER); | ||
|
||
// The terms in the term vector are stored as ByteRef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to be educated here. What's the advantage of using ByteRef instead of a String for each query token? It will be good to add the comment to the code as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already added the comment. The terms in the term vector are stored as ByteRef. The Seek function in the term Vector only take ByteRef as input, hence converting string to ByteRef.
@chenlica @sandeepreddy602 @zuozhi @rajesh9625 Can we finish this PR today? so that team4 can start on the Phrase Search Implementation and finish it by the end of the week. I will be pretty active on resolving the comments today. |
@akshaybetala I will do one more view this afternoon. Hopefully I have only minor comments for you before the merge. Other folks: please review it if you have time, or think it's ready to merge. |
@@ -3240,7 +3240,7 @@ analytically | |||
analyzable | |||
analyze | |||
analyzed | |||
analyzer | |||
luceneAnalyzer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this luceneAnalyzer
or just analyzer
?
I left a few more minor comments for @akshaybetala . The only remaining topic is the character offsets returned by Lucene. After we take care of this problem, the PR can be merged. |
Conflicts: textdb/textdb-dataflow/src/test/java/edu/uci/ics/textdb/dataflow/regexmatch/RegexMatcherTestHelper.java
@chenlica Can I go ahead and merge the PR? |
Previously a TEXT field doesn't enable positional indexing. In this PR we enable positional indexing in Lucene so that we can return information about character offsets and token offsets.