Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type #103

Merged
merged 11 commits into from
May 18, 2016

Conversation

akshaybetala
Copy link
Contributor

@akshaybetala akshaybetala commented May 17, 2016

Previously a TEXT field doesn't enable positional indexing. In this PR we enable positional indexing in Lucene so that we can return information about character offsets and token offsets.

@akshaybetala
Copy link
Contributor Author

@sandeepreddy602 @chenlica Please review.

fieldName, (String) fieldValue, Store.YES);
break;

org.apache.lucene.document.FieldType luceneFieldType = new org.apache.lucene.document.FieldType();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments to the codebase: "By default we enable positional indexing in Lucene so that we can return information about character offsets and token offsets.""

@chenlica chenlica changed the title Enabling Term Vectors [Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type May 17, 2016
@chenlica
Copy link
Collaborator

I left a few minor comments. It will be good if @sandeepreddy602 can also review it quickly. Then you can do the merge.

@akshaybetala
Copy link
Contributor Author

@chenlica @sandeepreddy602 @prakul Please Review.

this.fieldName = fieldName;
this.start = start;
this.end = end;
this.key = key;
this.value = value;
this.tokenOffset = -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give "-1" to a meaning constant such as "INVALID_TOKEN_OFFSET"?

@chenlica
Copy link
Collaborator

Added @zuozhi and @rajesh9625 in case they are interested to review as well.

@@ -38,7 +38,7 @@ public boolean equals(Object obj) {
if (list == null) {
if (other.list != null)
return false;
} else if (!list.equals(other.list))
} else if (!list.containsAll(other.list))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use equals()?

Copy link
Contributor Author

@akshaybetala akshaybetala May 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because even though the list has same number of elements, with same values, equals return false, where as containsAll returns True

other.list.containsAll(list) Returns True;
list.containsAll(others.list) Returns True;
other.list.equals(list) Returns False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because equals() cares about order, why we don't? If so, then the name "list" is not accurate. Should we rename it to "SetField"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the "list" name not accurate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not following. Why "other.list.equals(list) Returns False"?

return dTuple;
}

for(Attribute attr: attributeList){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not following this code. Comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comments in the code.

@chenlica
Copy link
Collaborator

I gave some comments. Please pay attention to the coding style. Also add necessary comments.

I felt this PR is a little too big to review. Is it possible to split it into two PRs (after you take care of some comments)?

@akshaybetala
Copy link
Contributor Author

Even I wanted to divide the PR into two, but the changes into the DataReader led to failure of the test cases of the existing operators. Hence had to change them too.

private List<BytesRef> queryTokensInBytesRef;
// The schema of the data tuple
private Schema schema;
//The schema o the data tuple along with the span information.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix "o". Please be more careful about comments :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@chenlica
Copy link
Collaborator

OK not to split this PR into two PRs.

// This makes the seek faster.
this.queryTokens.sort(String.CASE_INSENSITIVE_ORDER);

// The terms in the term vector are stored as ByteRef,
Copy link
Collaborator

@chenlica chenlica May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to be educated here. What's the advantage of using ByteRef instead of a String for each query token? It will be good to add the comment to the code as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already added the comment. The terms in the term vector are stored as ByteRef. The Seek function in the term Vector only take ByteRef as input, hence converting string to ByteRef.

@akshaybetala
Copy link
Contributor Author

@chenlica @sandeepreddy602 @zuozhi @rajesh9625 Can we finish this PR today? so that team4 can start on the Phrase Search Implementation and finish it by the end of the week. I will be pretty active on resolving the comments today.

@chenlica
Copy link
Collaborator

@akshaybetala I will do one more view this afternoon. Hopefully I have only minor comments for you before the merge. Other folks: please review it if you have time, or think it's ready to merge.

@@ -3240,7 +3240,7 @@ analytically
analyzable
analyze
analyzed
analyzer
luceneAnalyzer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this luceneAnalyzer or just analyzer?

@chenlica
Copy link
Collaborator

I left a few more minor comments for @akshaybetala .

The only remaining topic is the character offsets returned by Lucene. After we take care of this problem, the PR can be merged.

Conflicts:
	textdb/textdb-dataflow/src/test/java/edu/uci/ics/textdb/dataflow/regexmatch/RegexMatcherTestHelper.java
@akshaybetala
Copy link
Contributor Author

@chenlica Can I go ahead and merge the PR?

@akshaybetala akshaybetala merged commit db53ad5 into master May 18, 2016
@akshaybetala akshaybetala deleted the team4-phrase-operator branch May 18, 2016 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants