-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue #31] (Team 4) Enabling positional indexing in Lucene for TEXT type #103
Changes from 7 commits
8412121
9d6ced4
58b4062
5ea6e24
219d0fa
1b0c1e4
0e6ec5a
e60a943
0995954
674395b
6c3ce95
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,22 +3,40 @@ | |
public class Span { | ||
//The name of the field (in the tuple) where this span is present | ||
private String fieldName; | ||
//The start of the span | ||
//The start of the span. It is the position of the first character of span in the document. | ||
private int start; | ||
//The end of the span | ||
//The end of the span.It is the position of the first character of span in the document | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Change it to "The end position of the span, which is the offset of the gap after the last character of the span." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
private int end; | ||
//The key we are searching for eg: regex | ||
private String key; | ||
//The value matching the key | ||
private String value; | ||
|
||
|
||
public Span(String fieldName, int start, int end, String key, String value) { | ||
// The token position of the span | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Starting from 0? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
private int tokenOffset; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we added one more offset, it will be good add comments to explain that "start" and "end" are character offsets. Also add an example to explain their meaning, and explain that character offsets are for "gaps," and "tokenOffset" starts from 0 (?). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
|
||
/* | ||
Example: | ||
Value = "The quick brown fox jumps over the lazy dog" | ||
Now the Span for brown should be | ||
start = 10 : position of character 'b' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be more precise, it should be the offset of the gap before the character 'b', starting from 0. @zuozhi and @rajesh9625 @sandeepreddy602 please chime in to confirm. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since index for string starts from 0, it actually become the position of the character, not the previous character. |
||
end = 15 : position of character 'n' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similarly, it should be the offset of the gap after the character 'n', starting from 0. @zuozhi @rajesh9625 @sandeepreddy602 please chime in to confirm. A deep issue is how to make our "character gap offsets" consistent with "lucene character offsets". For the text
Is it consistent with the character offsets returned by Lucene? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Folks ( @sandeepreddy602 @zuozhi @rajesh9625 ) please chime in on this important discussion about the meaning of "character offset" in a span, and whether Lucene is giving what we wanted. If not, we need to a local translation from Lucene's offsets to our gap offsets. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @akshaybetala ... We should represent the gaps between the characters to specify a span (as seen from above example by @chenlica ). It is same as representing the range from first character index to last character index + 1. For example to represent Which will be same as gap representation as demonstrated below
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with this translation method suggested by @sandeepreddy602 . It's very local and easy to implement. @akshaybetala : please make this change. Make sure the test cases are consistent with the gap offsets. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is what we are doing. We get the start by indexOf() function, then add the length of the term to start, to get the end. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @akshaybetala .. That's correct. That follows the gap representation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just want to confirm: did you do the "+1" operation for the "end" position? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi guys. @sandeepreddy602 @akshaybetala I'm testing Lucene's position and offset, but I have some problems here. Have you successfully got positions from Lucene? Can you please take a look at the code and see what's wrong? FieldType ft = new FieldType(TextField.TYPE_STORED);
ft.setStoreTermVectors(true);
ft.setStoreTermVectorPositions(true);
ft.setStoreTermVectorOffsets(true);
ft.setStoreTermVectorPayloads(true);
Document doc = new Document();
doc.add(new Field("title", "some title here title1", ft));
doc.add(new Field("content", "keyword data content number", ft));
writer.addDocument(doc);
// ......
SpanTermQuery spanQuery = new SpanTermQuery(new Term("content", "keyword"));
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(spanQuery, reader.numDocs());
ScoreDoc[] hits = docs.scoreDocs;
for (int i = 0; i < hits.length; ++i) {
int docID = hits[i].doc;
System.out.println("docID: " + docID);
System.out.println("content: " + searcher.doc(docID).getField("content").stringValue());
Terms terms = reader.getTermVector(hits[i].doc, "content");
TermsEnum termsEnum = terms.iterator();
PostingsEnum postings = termsEnum.postings(null, PostingsEnum.PAYLOADS | PostingsEnum.OFFSETS);
while ((docID = postings.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
System.out.println(docID);
int freq = postings.freq();
for (int j = 0; j < freq; j++) {
System.out.println(postings.nextPosition());
System.out.println(postings.startOffset());
System.out.println(postings.endOffset());
System.out.println(postings.getPayload());
}
}
} And in line PostingsEnum postings = termsEnum.postings(null, PostingsEnum.PAYLOADS | PostingsEnum.OFFSETS); I got
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zuozhi You might want to take a look the changes I made in the DataReader. That code is working and gives you the position info. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also you are setting few properties while creating the field org.apache.lucene.document.FieldType luceneFieldType = new org.apache.lucene.document.FieldType(); |
||
tokenOffset = 2 position of word 'brown' | ||
*/ | ||
|
||
public static int INVALID_TOKEN_OFFSET = -1; | ||
|
||
public Span(String fieldName, int start, int end, String key, String value){ | ||
this.fieldName = fieldName; | ||
this.start = start; | ||
this.end = end; | ||
this.key = key; | ||
this.value = value; | ||
this.tokenOffset = INVALID_TOKEN_OFFSET; | ||
} | ||
|
||
public Span(String fieldName, int start, int end, String key, String value, int tokenOffset) { | ||
this(fieldName, start, end, key, value); | ||
this.tokenOffset = tokenOffset; | ||
} | ||
|
||
public String getFieldName() { | ||
|
@@ -41,6 +59,8 @@ public int getEnd() { | |
return end; | ||
} | ||
|
||
public int getTokenOffset(){return tokenOffset;} | ||
|
||
@Override | ||
public int hashCode() { | ||
final int prime = 31; | ||
|
@@ -51,6 +71,7 @@ public int hashCode() { | |
result = prime * result + ((key == null) ? 0 : key.hashCode()); | ||
result = prime * result + start; | ||
result = prime * result + ((value == null) ? 0 : value.hashCode()); | ||
result = prime * result + tokenOffset; | ||
return result; | ||
} | ||
|
||
|
@@ -87,7 +108,10 @@ public boolean equals(Object obj) { | |
return false; | ||
} else if (!value.equals(other.value)) | ||
return false; | ||
|
||
|
||
if(tokenOffset!= other.tokenOffset) | ||
return false; | ||
|
||
return true; | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,7 @@ | |
import org.apache.lucene.document.DateTools.Resolution; | ||
import org.apache.lucene.document.Field.Store; | ||
import org.apache.lucene.index.IndexableField; | ||
import org.apache.lucene.index.IndexOptions; | ||
|
||
import edu.uci.ics.textdb.api.common.Attribute; | ||
import edu.uci.ics.textdb.api.common.FieldType; | ||
|
@@ -49,18 +50,18 @@ public static IField getField(FieldType fieldType, String fieldValue) throws Par | |
case TEXT: | ||
field = new TextField(fieldValue); | ||
break; | ||
|
||
default: | ||
break; | ||
} | ||
return field; | ||
} | ||
|
||
public static IndexableField getLuceneField(FieldType fieldType, | ||
String fieldName, Object fieldValue) { | ||
String fieldName, Object fieldValue) { | ||
IndexableField luceneField = null; | ||
switch(fieldType){ | ||
case STRING: | ||
case STRING: | ||
luceneField = new org.apache.lucene.document.StringField( | ||
fieldName, (String) fieldValue, Store.YES); | ||
break; | ||
|
@@ -78,10 +79,22 @@ public static IndexableField getLuceneField(FieldType fieldType, | |
luceneField = new org.apache.lucene.document.StringField(fieldName, dateString, Store.YES); | ||
break; | ||
case TEXT: | ||
luceneField = new org.apache.lucene.document.TextField( | ||
fieldName, (String) fieldValue, Store.YES); | ||
break; | ||
|
||
//By default we enable positional indexing in Lucene so that we can return | ||
// information about character offsets and token offsets | ||
org.apache.lucene.document.FieldType luceneFieldType = new org.apache.lucene.document.FieldType(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add comments to the codebase: "By default we enable positional indexing in Lucene so that we can return information about character offsets and token offsets."" |
||
luceneFieldType.setIndexOptions( IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS ); | ||
luceneFieldType.setStored(true); | ||
luceneFieldType.setStoreTermVectors( true ); | ||
luceneFieldType.setStoreTermVectorOffsets( true ); | ||
luceneFieldType.setStoreTermVectorPayloads( true ); | ||
luceneFieldType.setStoreTermVectorPositions( true ); | ||
luceneFieldType.setTokenized( true ); | ||
|
||
luceneField = new org.apache.lucene.document.Field( | ||
fieldName,(String) fieldValue,luceneFieldType); | ||
|
||
break; | ||
|
||
} | ||
return luceneField; | ||
} | ||
|
@@ -96,10 +109,10 @@ public static ITuple getSpanTuple( List<IField> fieldList, List<Span> spanList, | |
IField[] fieldsDuplicate = fieldListDuplicate.toArray(new IField[fieldListDuplicate.size()]); | ||
return new DataTuple(spanSchema, fieldsDuplicate); | ||
} | ||
|
||
/** | ||
* | ||
* @param schema | ||
* | ||
* @param schema | ||
* @about Creating a new schema object, and adding SPAN_LIST_ATTRIBUTE to | ||
* the schema. SPAN_LIST_ATTRIBUTE is of type List | ||
*/ | ||
|
@@ -117,21 +130,25 @@ public static Schema createSpanSchema(Schema schema) { | |
|
||
/** | ||
* Tokenizes the query string using the given analyser | ||
* @param analyzer | ||
* @param luceneAnalyzer | ||
* @param query | ||
* @return ArrayList<String> list of results | ||
*/ | ||
public static ArrayList<String> tokenizeQuery(Analyzer analyzer, String query) { | ||
public static ArrayList<String> tokenizeQuery(Analyzer luceneAnalyzer, String query) { | ||
HashSet<String> resultSet = new HashSet<>(); | ||
ArrayList<String> result = new ArrayList<String>(); | ||
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(query)); | ||
TokenStream tokenStream = luceneAnalyzer.tokenStream(null, new StringReader(query)); | ||
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); | ||
|
||
try{ | ||
tokenStream.reset(); | ||
while (tokenStream.incrementToken()) { | ||
String term = charTermAttribute.toString(); | ||
resultSet.add(term); | ||
String token = charTermAttribute.toString(); | ||
int tokenIndex = query.toLowerCase().indexOf(token); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we do this extra There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since tokens are converted to lower case,get the exact token from the query string. |
||
// Since tokens are converted to lower case, | ||
// get the exact token from the query string. | ||
String actualQueryToken = query.substring(tokenIndex, tokenIndex+token.length()); | ||
resultSet.add(actualQueryToken); | ||
} | ||
tokenStream.close(); | ||
} catch (Exception e) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change it to "The start position of the span, which is the offset of the gap before the first character of the span."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done