-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10312: Add PersianStemmer #540
Conversation
apply :lucene:analysis:common:spotlessApply add org.apache.lucene.analysis.fa.PersianStemFilterFactory fix: Test PersianStemFilterFactory
Hi there. We (the Lucene.NET project) are waiting for approval of this stemmer before we will accept it into our codebase (apache/lucenenet#571). We aren't really sure how analysis components are vetted, so please let us know if there is anything else required for this to be accepted. |
I'm sorry for the late response. I just kicked the CI - I'll take a look. |
Hi Tomoko(@mocobeta). |
Yes it's correct, now this has passed the tests/checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth adding PersianStemFilter
to the Javadoc of PersianAnalyzer#createComponents()
.
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute; | ||
|
||
/** | ||
* A {@link TokenFilter} that applies {@link PersianStemmer} to stem Arabic words.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My IDE says "Two consecutive dots"; it looks like a typo.
/** | ||
* Factory for {@link PersianStemFilter}. | ||
* | ||
* <pre class="prettyprint"> | ||
* <fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100"> | ||
* <analyzer> | ||
* <tokenizer class="solr.StandardTokenizerFactory"/> | ||
* <filter class="solr.PersianNormalizationFilterFactory"/> | ||
* <filter class="solr.PersianStemFilterFactory"/> | ||
* </analyzer> | ||
* </fieldType></pre> | ||
* | ||
* @since 3.1 | ||
* @lucene.spi {@value #NAME} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Solr-scheme example is obsoleted and no longer needed in Lucene Javadoc, can you please remove the XML stuff? Instead, you can list the parameters like this.
Also, I suppose @since
should be 9.2.0
(the next minor release).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PersianStemFilterFactory
takes no parameters, so you can just delete <pre>...</pre>
.
|
||
/** | ||
* Stemmer for Persian. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be worth mentioning what algorithm is used/implemented in the stemmer if it's possible.
For example, see
- https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/bg/BulgarianStemmer.java
- https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java
- https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/hi/HindiStemmer.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the ArabicStemmer does not mention the algorithms or rules it bases on. As @NightOwl888 told me, this PersianStemmer is a derivative component of it; then I'm fine with the javadocs as is.
public static final char ALEF = '\u0627'; | ||
public static final char HEH = '\u0647'; | ||
public static final char TEH = '\u062A'; | ||
public static final char REH = '\u0631'; | ||
public static final char NOON = '\u0646'; | ||
public static final char YEH = '\u064A'; | ||
public static final char ZWNJ = '\u200c'; | ||
|
||
public static final char[][] suffixes = { | ||
("" + ALEF + TEH).toCharArray(), | ||
("" + ALEF + NOON).toCharArray(), | ||
("" + TEH + REH + YEH + NOON).toCharArray(), | ||
("" + TEH + REH).toCharArray(), | ||
("" + YEH + YEH).toCharArray(), | ||
("" + YEH).toCharArray(), | ||
("" + HEH + ALEF).toCharArray(), | ||
("" + ZWNJ).toCharArray(), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These constants can be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is based on the ArabicStemmer where these are public, it would seem odd to make them public in one case and private in the other. Same goes for the stem
, stemSuffix
and stemPrefix
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. I haven't noticed that.
The original ArabicStemmer was added in 2010 and those public static
constants seem unchanged since then. It's a bad practice in these days to unnecessarily expose constants/variables/methods; especially it isn't safe to expose the suffixes
char array - it's substantially mutable, even this is marked as final.
Please keep class members private as far as possible. I will open an issue for ArabicStemmer
to make those members private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Should that be expanded to include ArabicNormalizer and PersianNormalizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I think so. This is another issue, we can improve them later.
} | ||
|
||
/** | ||
* Stem suffix(es) off an Persian word. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, my IDE suggests "use 'a' instead of 'an'".
* @param len length of input buffer | ||
* @return new length of input buffer after stemming | ||
*/ | ||
public int stemSuffix(char[] s, int len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also can be private I think?
* @param suffix suffix to check | ||
* @return true if the suffix matches and can be stemmed | ||
*/ | ||
boolean endsWithCheckLength(char[] s, int len, char[] suffix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - change the method visibility to private, please.
for (int i = 0; i < suffixes.length; i++) | ||
if (endsWithCheckLength(s, len, suffixes[i])) | ||
len = deleteN(s, len - suffixes[i].length, len, suffixes[i].length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add {
and }
to these for and if clauses; ommiting them is error-prone.
boolean endsWithCheckLength(char[] s, int len, char[] suffix) { | ||
if (len < suffix.length + 2) { // all suffixes require at least 2 characters after stemming | ||
return false; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this else
- this is not needed and fewer nests are better.
for (int i = 0; i < suffix.length; i++) { | ||
if (s[len - suffix.length + i] != suffix[i]) { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use Arrays.equals(...)?
assertTokenStreamContents(filter, new String[] {"ساهدهات"}); | ||
} | ||
|
||
private void check(final String input, final String expected) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have BaseTokenStreamTestCase#checkOneTerm(Analyzer, input, expected)
. Is it possible to replace this with the built-in check method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how it was done in the TestArabicStemFilter class, which this is based on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, thanks! I think both of them could be replaceable with the built-in check method so that they are consistent with other analyzer's tests, and the built-in check method includes a few more consistency checks for the analyzed tokens than the current check()
method. I'll look at both of them another time. So it's fine with me for now.
I left some minor comments. |
public static final char REH = '\u0631'; | ||
public static final char NOON = '\u0646'; | ||
public static final char YEH = '\u064A'; | ||
public static final char ZWNJ = '\u200c'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems inconsistent - it is not a letter, but a Zero-Width Non-Joining character. It seems that the abbreviation for this constant should be more descriptive than the others. Would you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - do you have any suggestions for the name? I've actually never seen before the character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...I discovered that SoraniNormalizer also uses ZWNJ
. I guess the name isn't as big of a deal if we are making it private or package-private. But to me, it would be more intelligible to change them both to spell out ZERO_WIDTH_NON_JOINER
than to use an unpronounceable constant for this one case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-acronym version is fine with me, anyway the constant is private use. I don't think the inconsistency with SoraniNormalizer would cause any problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, for the sake of consistency, I yield. We should keep it as ZWNJ
.
FYI, the feature freeze for the next release will be 10th May, according to the proposal for 9.2 release at Lucene's dev mail list. |
Sorry for the delay in responding. |
I just made a small change on it 050cbf1. Looks great, thank you @raminmjj and @NightOwl888! I'm merging this to main and will backport it to the 9x branch soon. |
Co-authored-by: Tomoko Uchida <tomoko.uchida.1111@gmail.com>
@mocobeta - Thanks for merging this. Please let me know the Jira issue number(s) for the related work on |
@NightOwl888 I opened https://issues.apache.org/jira/browse/LUCENE-10561 for them. |
* main: LUCENE-10532: remove @Slow annotation (apache#832) LUCENE-10312: Add PersianStemmer (apache#540) LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871) LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869) Disable liftbot, we have our own tools LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860) Make CONTRIBUTING.md a bit more succinct (apache#866) LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796) Add change line for LUCENE-9848 LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862)
* main: (24 commits) LUCENE-10532: remove @Slow annotation (apache#832) LUCENE-10312: Add PersianStemmer (apache#540) LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871) LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869) Disable liftbot, we have our own tools LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860) Make CONTRIBUTING.md a bit more succinct (apache#866) LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796) Add change line for LUCENE-9848 LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862) LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md (apache#853) LUCENE-10552: KnnVectorQuery has incorrect equals/ hashCode (apache#859) LUCENE-10534: MinFloatFunction / MaxFloatFunction calls exists twice (apache#837) LUCENE-10188: Give SortedSetDocValues a docValueCount() (apache#663) Allow to link to github PR from changes (apache#854) LUCENE-10551: improve testing of LowercaseAsciiCompression (apache#858) LUCENE-10542: FieldSource exists implementations can avoid value retrieval (apache#847) LUCENE-10539: Return a stream of completions from FSTCompletion. (apache#844) gradle 7.3.3 quick upgrade (apache#856) LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations (apache#848) ...
Added changes based on apache/lucene#540 and https://issues.apache.org/jira/browse/LUCENE-10312
main
branch../gradlew check
.