Skip to content

Commit

Permalink
LUCENE-10081: KoreanTokenizer should check the max backtrace gap on w…
Browse files Browse the repository at this point in the history
…hitespaces (#272)

This change ensures that we don't skip consecutive whitespaces without checking the maximum backtrace gap.
  • Loading branch information
jimczi authored Sep 6, 2021
1 parent 34f37d0 commit 4df8d64
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 12 deletions.
3 changes: 3 additions & 0 deletions lucene/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -484,6 +484,9 @@ Bug Fixes

* LUCENE-10060: Ensure DrillSidewaysQuery instances never get cached. (Greg Miller, Zachary Chen)

* LUCENE-10081: KoreanTokenizer should check the max backtrace gap on whitespaces.
(Jim Ferenczi)

Other
---------------------
(No changes)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -746,21 +746,15 @@ private void parse() throws IOException {
System.out.println(" " + posData.count + " arcs in");
}

// Move to the first character that is not a whitespace.
// The whitespaces are added as a prefix for the term that we extract,
// this information is then used when computing the cost for the term using
// the space penalty factor.
// They are removed when the final tokens are generated.
// We add single space separator as prefixes of the terms that we extract.
// This information is needed to compute the space penalty factor of each term.
// These whitespace prefixes are removed when the final tokens are generated, or
// added as separated tokens when discardPunctuation is unset.
if (Character.getType(buffer.get(pos)) == Character.SPACE_SEPARATOR) {
int nextChar = buffer.get(++pos);
while (nextChar != -1 && Character.getType(nextChar) == Character.SPACE_SEPARATOR) {
pos++;
nextChar = buffer.get(pos);
if (buffer.get(++pos) == -1) {
pos = posData.pos;
}
}
if (buffer.get(pos) == -1) {
pos = posData.pos;
}

boolean anyMatches = false;

Expand Down

0 comments on commit 4df8d64

Please sign in to comment.