Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestRandomChains failure caused by incorrect delegation in CharReader/CharFilter/CharStream API [LUCENE-3990] #5063

Open
asfimport opened this issue Apr 15, 2012 · 19 comments

Comments

@asfimport
Copy link

asfimport commented Apr 15, 2012

100% reproduces for me:

2> NOTE: reproduce with: ant test -Dtests.class=*.TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=88CA02C2BB7B1DA -Dargs="-Dfile.encoding=UTF-8"

Running org.apache.lucene.analysis.core.TestRandomChains
FAILURE 7.22s | TestRandomChains.testRandomChains

Throwable #1: java.lang.AssertionError: endOffset 1 expected:<7> but was:<8>
at __randomizedtesting.SeedInfo.seed([88CA02C2BB7B1DA:356D894D6CA5AC1A]:0)
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.failNotEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:128)
at org.junit.Assert.assertEquals(Assert.java:472)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:165)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:662)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:486)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:429)
at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:820)

The root cause of this is inconsequent override of several Reader methods in subclasses of CharFilter. We should fix this urgently, thanks to the random chains we found this bug.


Migrated from LUCENE-3990 by Steven Rowe (@sarowe), updated May 09 2016
Attachments: analysis-common.tests-report.txt, LUCENE-3990.patch, LUCENE-3990-CharFilterFix.patch
Linked issues:

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

Definitely something sneaky going on...

first i found a minor bug in the test. the logic for determining if we should apply
additional offsets checks doesn't look at the return value, so if we ever try
a tokenizer in brokenOffsetsComponents (even if it throws IAE and we decide not to use it)
then we lose offsets checks for the eventually-working-chain.

seed still works after fixing this... I'll commit a fix for that (still doesnt help this problem)

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

The problem is that we are getting different results the first time we create the tokenstream components,
versus after we reset(Reader) with the same text again.

The bug was introduced by Uwe Schindler in r1311358: when the reader-wrapper was changed to use CharFilter
instead. because of crazy CharFilter-Reader delegation.

http://svn.apache.org/viewvc?view=revision&revision=1311358

Attached is a patch demonstrating the bug: with a standalone testcase, and backing out that change.
Seed now passes (in addition to the test.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Robert,

if this patch fixes the problem we should fix the CharFilter interface. Simply reverting my change hides the bug, sorry.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

We have to fix the underlying bug. Subclassing CharFilter here is correct.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Robert,

I investigated the problem, its indeed crazy wrapping: The problem was that CharFilter did not override read() (without char[]). The same applied to CharReader!!! By not overriding that method it was also slowing down all charfilters, because the base class Reader automatically delegates to read(char[]), creating a new char[1] every time.

The attached patch fixes this to delegate correctly.

@asfimport
Copy link
Author

Dawid Weiss (@dweiss) (migrated from JIRA)

I remember looking at the fact that read() is not overridden when I was working on that pattern char filter and was surprised a bit about it, but then thought maybe it's wrapped in a buffered stream someplace else (then it doesn't matter).

A good way to make sure it does get proper implementation is to make read() abstract again in CharStream...

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed trunk revision: 1326512

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Dawid: I did not notice, indeed PatternCharReplaceFilter now gets a failure. I'll revert. The problem is exactly as you said. Thats completely broken, we should make that abstract!

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

I reverted the CharFilter changes in revision: 1326515

We should really fix the broken delegation and make CharFilters require to implement both read() and read(char[], int, int)

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

We should fix CharFilters in trunk to not have the horrible delegating.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Robert: Sorry for the noise with heavy committing, and thanks for pointing to that bug!

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

I investigated why the original commit lead to the bug:
The current committed stuff (Robert's revert and also my latest patch), just hide a bug in MappingCharFilter that causes this. Let me explain:

  • With Robert's patch (and also my current fix): read(char[],int,int) and read() both delegate to the underlying charstream.
  • With the original CharFilter approach, int read() was not overridden, so the default method in java.io.Reader did the following: allocate char[1], call read(buffer,0,1) -> MappingCharFilters's buffered read method was called. Unfortunately this method is behaving different than MappingCharFilter's one-character read(). And thats the bug here. So neither my code nor Roberts code is to blame, its MappingCharFilter or MockCharFilter that behave different in the two read methods.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow).

I propose to adda CharFilter consistency method that reads two instances of the same CharFilter, one using read() and in parallel using read(char[]) with varying buffer sizes. It should check offsets (and that is what is heavy buggy in MappingCharOffsetCorrumpter / MockCharCorrumpter).

I'll prepare a patch with the test method in BaseTokenStreamTestCase.

@asfimport
Copy link
Author

asfimport commented Apr 16, 2012

Robert Muir (@rmuir) (migrated from JIRA)

Hi Uwe, i tried these same things, I know the interface is horrible but that wasnt really the bug here.

If we want to fix CharFilter/CharStream i think we should use #3862.

That removes CharFilter/CharStream/CharReader and replaces with just one CharFilter class that extends FilterReader and is reusable. I think its a simpler way to go at least to only have one class...

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow).

We don't need to do all that actually. I think its enough if we have a CharFilter or Tokenizer that
randomly uses read(CharBuffer)/read(char[])/read()/read(char[], int,int)...

the other methods should be tested too.

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

rmuir20120906-bulk-40-change

@asfimport
Copy link
Author

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants