Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10312: Make stemming configurable on PersianAnalyzer #906

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -88,30 +88,43 @@ private static class DefaultSetHolder {
}

private final CharArraySet stemExclusionSet;
private final boolean useStemming;

/** Builds an analyzer with the default stop words: {@link #DEFAULT_STOPWORD_FILE}. */
public PersianAnalyzer() {
this(DefaultSetHolder.DEFAULT_STOP_SET);
}

/**
* Builds an analyzer with the given stop words
* Builds an analyzer with the default stop words: {@link #DEFAULT_STOPWORD_FILE}
*
* @param useStemming whether or not to enable stemming
*/
public PersianAnalyzer(boolean useStemming) {
this(DefaultSetHolder.DEFAULT_STOP_SET, useStemming, CharArraySet.EMPTY_SET);
}

/**
* Builds an analyzer with the given stop words and no stemming
*
* @param stopwords a stopword set
*/
public PersianAnalyzer(CharArraySet stopwords) {
this(stopwords, CharArraySet.EMPTY_SET);
this(stopwords, false, CharArraySet.EMPTY_SET);
}

/**
* Builds an analyzer with the given stop word. If a none-empty stem exclusion set is provided
* this analyzer will add a {@link SetKeywordMarkerFilter} before {@link PersianStemFilter}.
* Builds an analyzer with the given stop word. If a non-empty stem exclusion set is provided this
* analyzer will add a {@link SetKeywordMarkerFilter} before {@link PersianStemFilter}.
*
* @param stopwords a stopword set
* @param useStemming whether or not to enable stemming
* @param stemExclusionSet a set of terms not to be stemmed
*/
public PersianAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionSet) {
public PersianAnalyzer(
CharArraySet stopwords, boolean useStemming, CharArraySet stemExclusionSet) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this three-args constructor private and keep the two-args constructor public PersianAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionSet) so that we make the API changes minimum on the next major release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suppose when stemExclusionSet is set useStemming flag is always set to true, so I think the three-args constructor can be internal-use only.

super(stopwords);
this.useStemming = useStemming;
this.stemExclusionSet = CharArraySet.unmodifiableSet(CharArraySet.copy(stemExclusionSet));
}

Expand Down Expand Up @@ -140,7 +153,10 @@ protected TokenStreamComponents createComponents(String fieldName) {
if (!stemExclusionSet.isEmpty()) {
result = new SetKeywordMarkerFilter(result, stemExclusionSet);
}
return new TokenStreamComponents(source, new PersianStemFilter(result));
if (useStemming) {
result = new PersianStemFilter(result);
}
return new TokenStreamComponents(source, result);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -227,4 +227,15 @@ public void testRandomStrings() throws Exception {
checkRandomData(random(), a, 200 * RANDOM_MULTIPLIER);
a.close();
}

public void testStemming() throws Exception {
{
PersianAnalyzer a = new PersianAnalyzer();
checkOneTerm(a, "دوستان", "دوستان");
}
{
PersianAnalyzer a = new PersianAnalyzer(true);
checkOneTerm(a, "دوستان", "دوست");
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ public class TestPersianStemFilter extends BaseTokenStreamTestCase {
@Override
public void setUp() throws Exception {
super.setUp();
a = new PersianAnalyzer();
a = new PersianAnalyzer(true);
}

@Override
Expand Down