-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestRandomChains failure caused by incorrect delegation in CharReader/CharFilter/CharStream API [LUCENE-3990] #5063
Comments
Robert Muir (@rmuir) (migrated from JIRA) Definitely something sneaky going on... first i found a minor bug in the test. the logic for determining if we should apply seed still works after fixing this... I'll commit a fix for that (still doesnt help this problem) |
Robert Muir (@rmuir) (migrated from JIRA) The problem is that we are getting different results the first time we create the tokenstream components, The bug was introduced by Uwe Schindler in r1311358: when the reader-wrapper was changed to use CharFilter http://svn.apache.org/viewvc?view=revision&revision=1311358 Attached is a patch demonstrating the bug: with a standalone testcase, and backing out that change. |
Uwe Schindler (@uschindler) (migrated from JIRA) Hi Robert, if this patch fixes the problem we should fix the CharFilter interface. Simply reverting my change hides the bug, sorry. |
Uwe Schindler (@uschindler) (migrated from JIRA) We have to fix the underlying bug. Subclassing CharFilter here is correct. |
Uwe Schindler (@uschindler) (migrated from JIRA) Hi Robert, I investigated the problem, its indeed crazy wrapping: The problem was that CharFilter did not override read() (without char[]). The same applied to CharReader!!! By not overriding that method it was also slowing down all charfilters, because the base class Reader automatically delegates to read(char[]), creating a new char[1] every time. The attached patch fixes this to delegate correctly. |
Dawid Weiss (@dweiss) (migrated from JIRA) I remember looking at the fact that read() is not overridden when I was working on that pattern char filter and was surprised a bit about it, but then thought maybe it's wrapped in a buffered stream someplace else (then it doesn't matter). A good way to make sure it does get proper implementation is to make read() abstract again in CharStream... |
Uwe Schindler (@uschindler) (migrated from JIRA) Committed trunk revision: 1326512 |
Uwe Schindler (@uschindler) (migrated from JIRA) Dawid: I did not notice, indeed PatternCharReplaceFilter now gets a failure. I'll revert. The problem is exactly as you said. Thats completely broken, we should make that abstract! |
Uwe Schindler (@uschindler) (migrated from JIRA) I reverted the CharFilter changes in revision: 1326515 We should really fix the broken delegation and make CharFilters require to implement both read() and read(char[], int, int) |
Uwe Schindler (@uschindler) (migrated from JIRA) We should fix CharFilters in trunk to not have the horrible delegating. |
Uwe Schindler (@uschindler) (migrated from JIRA) Robert: Sorry for the noise with heavy committing, and thanks for pointing to that bug! |
Uwe Schindler (@uschindler) (migrated from JIRA) I investigated why the original commit lead to the bug:
|
Uwe Schindler (@uschindler) (migrated from JIRA) We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow). I propose to adda CharFilter consistency method that reads two instances of the same CharFilter, one using read() and in parallel using read(char[]) with varying buffer sizes. It should check offsets (and that is what is heavy buggy in MappingCharOffsetCorrumpter / MockCharCorrumpter). I'll prepare a patch with the test method in BaseTokenStreamTestCase. |
Robert Muir (@rmuir) (migrated from JIRA) Hi Uwe, i tried these same things, I know the interface is horrible but that wasnt really the bug here. If we want to fix CharFilter/CharStream i think we should use #3862. That removes CharFilter/CharStream/CharReader and replaces with just one CharFilter class that extends FilterReader and is reusable. I think its a simpler way to go at least to only have one class... |
Robert Muir (@rmuir) (migrated from JIRA)
We don't need to do all that actually. I think its enough if we have a CharFilter or Tokenizer that the other methods should be tested too. |
Chris M. Hostetter (@hossman) (migrated from JIRA) bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment |
Robert Muir (@rmuir) (migrated from JIRA) rmuir20120906-bulk-40-change |
Steven Rowe (@sarowe) (migrated from JIRA) Bulk move 4.4 issues to 4.5 and 5.0 |
Uwe Schindler (@uschindler) (migrated from JIRA) Move issue to Lucene 4.9. |
100% reproduces for me:
2> NOTE: reproduce with: ant test -Dtests.class=*.TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=88CA02C2BB7B1DA -Dargs="-Dfile.encoding=UTF-8"
Running org.apache.lucene.analysis.core.TestRandomChains
FAILURE 7.22s | TestRandomChains.testRandomChains
The root cause of this is inconsequent override of several Reader methods in subclasses of CharFilter. We should fix this urgently, thanks to the random chains we found this bug.
Migrated from LUCENE-3990 by Steven Rowe (@sarowe), updated May 09 2016
Attachments: analysis-common.tests-report.txt, LUCENE-3990.patch, LUCENE-3990-CharFilterFix.patch
Linked issues:
The text was updated successfully, but these errors were encountered: