[Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type #103

akshaybetala · 2016-05-17T02:27:17Z

Previously a TEXT field doesn't enable positional indexing. In this PR we enable positional indexing in Lucene so that we can return information about character offsets and token offsets.

akshaybetala · 2016-05-17T02:30:00Z

@sandeepreddy602 @chenlica Please review.

chenlica · 2016-05-17T03:30:11Z

textdb/textdb-common/src/main/java/edu/uci/ics/textdb/common/utils/Utils.java

-	                    fieldName, (String) fieldValue, Store.YES);
-	            break;
-
+                org.apache.lucene.document.FieldType luceneFieldType = new org.apache.lucene.document.FieldType();


Add comments to the codebase: "By default we enable positional indexing in Lucene so that we can return information about character offsets and token offsets.""

chenlica · 2016-05-17T03:33:20Z

I left a few minor comments. It will be good if @sandeepreddy602 can also review it quickly. Then you can do the merge.

akshaybetala · 2016-05-17T22:29:42Z

@chenlica @sandeepreddy602 @prakul Please Review.

chenlica · 2016-05-17T23:05:13Z

textdb/textdb-common/src/main/java/edu/uci/ics/textdb/common/field/Span.java

        this.fieldName = fieldName;
        this.start = start;
        this.end = end;
        this.key = key;
        this.value = value;
+        this.tokenOffset = -1;


Give "-1" to a meaning constant such as "INVALID_TOKEN_OFFSET"?

chenlica · 2016-05-17T23:15:47Z

Added @zuozhi and @rajesh9625 in case they are interested to review as well.

chenlica · 2016-05-17T23:19:14Z

textdb/textdb-common/src/main/java/edu/uci/ics/textdb/common/field/ListField.java

@@ -38,7 +38,7 @@ public boolean equals(Object obj) {
        if (list == null) {
            if (other.list != null)
                return false;
-        } else if (!list.equals(other.list))
+        } else if (!list.containsAll(other.list))


Why don't we use equals()?

Because even though the list has same number of elements, with same values, equals return false, where as containsAll returns True

other.list.containsAll(list) Returns True;
list.containsAll(others.list) Returns True;
other.list.equals(list) Returns False

Is it because equals() cares about order, why we don't? If so, then the name "list" is not accurate. Should we rename it to "SetField"?

Why is the "list" name not accurate?

I am still not following. Why "other.list.equals(list) Returns False"?

chenlica · 2016-05-17T23:43:31Z

textdb/textdb-storage/src/main/java/edu/uci/ics/textdb/storage/reader/DataReader.java

+                return  dTuple;
+            }
+
+            for(Attribute attr: attributeList){


Not following this code. Comments?

Added the comments in the code.

chenlica · 2016-05-17T23:44:48Z

I gave some comments. Please pay attention to the coding style. Also add necessary comments.

I felt this PR is a little too big to review. Is it possible to split it into two PRs (after you take care of some comments)?

akshaybetala · 2016-05-18T00:18:55Z

Even I wanted to divide the PR into two, but the changes into the DataReader led to failure of the test cases of the existing operators. Hence had to change them too.

chenlica · 2016-05-18T00:46:13Z

textdb/textdb-storage/src/main/java/edu/uci/ics/textdb/storage/reader/DataReader.java

+    private List<BytesRef> queryTokensInBytesRef;
+    // The schema of the data tuple
+    private Schema schema;
+    //The schema o the data tuple along with the span information.


Fix "o". Please be more careful about comments :-)

chenlica · 2016-05-18T00:50:38Z

OK not to split this PR into two PRs.

chenlica · 2016-05-18T00:58:02Z

textdb/textdb-storage/src/main/java/edu/uci/ics/textdb/storage/reader/DataReader.java

+            // This makes the seek faster.
+            this.queryTokens.sort(String.CASE_INSENSITIVE_ORDER);
+
+            // The terms in the term vector are stored as ByteRef,


I think I need to be educated here. What's the advantage of using ByteRef instead of a String for each query token? It will be good to add the comment to the code as well.

I have already added the comment. The terms in the term vector are stored as ByteRef. The Seek function in the term Vector only take ByteRef as input, hence converting string to ByteRef.

akshaybetala · 2016-05-18T20:57:25Z

@chenlica @sandeepreddy602 @zuozhi @rajesh9625 Can we finish this PR today? so that team4 can start on the Phrase Search Implementation and finish it by the end of the week. I will be pretty active on resolving the comments today.

chenlica · 2016-05-18T21:24:24Z

@akshaybetala I will do one more view this afternoon. Hopefully I have only minor comments for you before the merge. Other folks: please review it if you have time, or think it's ready to merge.

chenlica · 2016-05-18T21:30:58Z

textdb/textdb-dataflow/src/main/resources/queryrewriter/wordsEn.txt

@@ -3240,7 +3240,7 @@ analytically
 analyzable
 analyze
 analyzed
-analyzer
+luceneAnalyzer


Do we need this luceneAnalyzer or just analyzer?

chenlica · 2016-05-18T22:15:05Z

I left a few more minor comments for @akshaybetala .

The only remaining topic is the character offsets returned by Lucene. After we take care of this problem, the PR can be merged.

Conflicts: textdb/textdb-dataflow/src/test/java/edu/uci/ics/textdb/dataflow/regexmatch/RegexMatcherTestHelper.java

akshaybetala · 2016-05-18T23:51:44Z

@chenlica Can I go ahead and merge the PR?

Enabling Term Vectors

8412121

chenlica reviewed May 17, 2016
View reviewed changes

chenlica changed the title ~~Enabling Term Vectors~~ [Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type May 17, 2016

akshaybetala added 4 commits May 17, 2016 15:04

Adding support for Position information in Data Reader

9d6ced4

Adding comment

58b4062

Merge branch 'master' into team4-phrase-operator

5ea6e24

Merge from master

219d0fa

chenlica reviewed May 17, 2016
View reviewed changes

Adding comments and minor refactoring

1b0c1e4

chenlica reviewed May 18, 2016
View reviewed changes

Adding comments

0e6ec5a

chenlica reviewed May 18, 2016
View reviewed changes

akshaybetala added 2 commits May 18, 2016 13:42

Adding comments

e60a943

Merge branch 'master' into team4-phrase-operator

0995954

chenlica reviewed May 18, 2016
View reviewed changes

akshaybetala added 2 commits May 18, 2016 16:34

Merge remote-tracking branch 'origin/master' into team4-phrase-operator

674395b

Conflicts: textdb/textdb-dataflow/src/test/java/edu/uci/ics/textdb/dataflow/regexmatch/RegexMatcherTestHelper.java

Minor changes and comments

6c3ce95

akshaybetala merged commit db53ad5 into master May 18, 2016

akshaybetala deleted the team4-phrase-operator branch May 18, 2016 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type #103

[Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type #103

akshaybetala commented May 17, 2016 •

edited by chenlica

Loading

akshaybetala commented May 17, 2016

chenlica May 17, 2016

chenlica commented May 17, 2016

akshaybetala commented May 17, 2016

chenlica May 17, 2016

chenlica commented May 17, 2016

chenlica May 17, 2016

akshaybetala May 17, 2016 •

edited

Loading

chenlica May 17, 2016

akshaybetala May 18, 2016

chenlica May 18, 2016

chenlica May 17, 2016

akshaybetala May 18, 2016

chenlica commented May 17, 2016

akshaybetala commented May 18, 2016

chenlica May 18, 2016

akshaybetala May 18, 2016

chenlica commented May 18, 2016

chenlica May 18, 2016 •

edited

Loading

akshaybetala May 18, 2016

akshaybetala commented May 18, 2016

chenlica commented May 18, 2016

chenlica May 18, 2016

chenlica commented May 18, 2016

akshaybetala commented May 18, 2016

[Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type #103

[Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type #103

Conversation

akshaybetala commented May 17, 2016 • edited by chenlica Loading

akshaybetala commented May 17, 2016

Choose a reason for hiding this comment

chenlica commented May 17, 2016

akshaybetala commented May 17, 2016

Choose a reason for hiding this comment

chenlica commented May 17, 2016

Choose a reason for hiding this comment

akshaybetala May 17, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenlica commented May 17, 2016

akshaybetala commented May 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenlica commented May 18, 2016

chenlica May 18, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akshaybetala commented May 18, 2016

chenlica commented May 18, 2016

Choose a reason for hiding this comment

chenlica commented May 18, 2016

akshaybetala commented May 18, 2016

akshaybetala commented May 17, 2016 •

edited by chenlica

Loading

akshaybetala May 17, 2016 •

edited

Loading

chenlica May 18, 2016 •

edited

Loading