TestRandomChains failure caused by incorrect delegation in CharReader/CharFilter/CharStream API [LUCENE-3990] #5063

asfimport · 2012-04-15T17:24:03Z

100% reproduces for me:

2> NOTE: reproduce with: ant test -Dtests.class=*.TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=88CA02C2BB7B1DA -Dargs="-Dfile.encoding=UTF-8"

Running org.apache.lucene.analysis.core.TestRandomChains
FAILURE 7.22s | TestRandomChains.testRandomChains

Throwable #1: java.lang.AssertionError: endOffset 1 expected:<7> but was:<8>
at __randomizedtesting.SeedInfo.seed([88CA02C2BB7B1DA:356D894D6CA5AC1A]:0)
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.failNotEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:128)
at org.junit.Assert.assertEquals(Assert.java:472)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:165)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:662)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:486)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:429)
at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:820)

The root cause of this is inconsequent override of several Reader methods in subclasses of CharFilter. We should fix this urgently, thanks to the random chains we found this bug.

Migrated from LUCENE-3990 by Steven Rowe (@sarowe), updated May 09 2016
Attachments: analysis-common.tests-report.txt, LUCENE-3990.patch, LUCENE-3990-CharFilterFix.patch
Linked issues:

more thorough testing of analysis chains [LUCENE-3919] #4992

asfimport · 2012-04-16T01:31:29Z

Robert Muir (@rmuir) (migrated from JIRA)

Definitely something sneaky going on...

first i found a minor bug in the test. the logic for determining if we should apply
additional offsets checks doesn't look at the return value, so if we ever try
a tokenizer in brokenOffsetsComponents (even if it throws IAE and we decide not to use it)
then we lose offsets checks for the eventually-working-chain.

seed still works after fixing this... I'll commit a fix for that (still doesnt help this problem)

asfimport · 2012-04-16T02:56:04Z

Robert Muir (@rmuir) (migrated from JIRA)

The problem is that we are getting different results the first time we create the tokenstream components,
versus after we reset(Reader) with the same text again.

The bug was introduced by Uwe Schindler in r1311358: when the reader-wrapper was changed to use CharFilter
instead. because of crazy CharFilter-Reader delegation.

http://svn.apache.org/viewvc?view=revision&revision=1311358

Attached is a patch demonstrating the bug: with a standalone testcase, and backing out that change.
Seed now passes (in addition to the test.

asfimport · 2012-04-16T05:54:58Z

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Robert,

if this patch fixes the problem we should fix the CharFilter interface. Simply reverting my change hides the bug, sorry.

asfimport · 2012-04-16T05:57:27Z

Uwe Schindler (@uschindler) (migrated from JIRA)

We have to fix the underlying bug. Subclassing CharFilter here is correct.

asfimport · 2012-04-16T06:45:52Z

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Robert,

I investigated the problem, its indeed crazy wrapping: The problem was that CharFilter did not override read() (without char[]). The same applied to CharReader!!! By not overriding that method it was also slowing down all charfilters, because the base class Reader automatically delegates to read(char[]), creating a new char[1] every time.

The attached patch fixes this to delegate correctly.

asfimport · 2012-04-16T07:02:02Z

Dawid Weiss (@dweiss) (migrated from JIRA)

I remember looking at the fact that read() is not overridden when I was working on that pattern char filter and was surprised a bit about it, but then thought maybe it's wrapped in a buffered stream someplace else (then it doesn't matter).

A good way to make sure it does get proper implementation is to make read() abstract again in CharStream...

asfimport · 2012-04-16T07:03:01Z

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed trunk revision: 1326512

asfimport · 2012-04-16T07:08:45Z

Uwe Schindler (@uschindler) (migrated from JIRA)

Dawid: I did not notice, indeed PatternCharReplaceFilter now gets a failure. I'll revert. The problem is exactly as you said. Thats completely broken, we should make that abstract!

asfimport · 2012-04-16T07:20:45Z

Uwe Schindler (@uschindler) (migrated from JIRA)

I reverted the CharFilter changes in revision: 1326515

We should really fix the broken delegation and make CharFilters require to implement both read() and read(char[], int, int)

asfimport · 2012-04-16T07:21:16Z

Uwe Schindler (@uschindler) (migrated from JIRA)

We should fix CharFilters in trunk to not have the horrible delegating.

asfimport · 2012-04-16T07:24:16Z

Uwe Schindler (@uschindler) (migrated from JIRA)

Robert: Sorry for the noise with heavy committing, and thanks for pointing to that bug!

asfimport · 2012-04-16T08:07:55Z

Uwe Schindler (@uschindler) (migrated from JIRA)

I investigated why the original commit lead to the bug:
The current committed stuff (Robert's revert and also my latest patch), just hide a bug in MappingCharFilter that causes this. Let me explain:

With Robert's patch (and also my current fix): read(char[],int,int) and read() both delegate to the underlying charstream.
With the original CharFilter approach, int read() was not overridden, so the default method in java.io.Reader did the following: allocate char[1], call read(buffer,0,1) -> MappingCharFilters's buffered read method was called. Unfortunately this method is behaving different than MappingCharFilter's one-character read(). And thats the bug here. So neither my code nor Roberts code is to blame, its MappingCharFilter or MockCharFilter that behave different in the two read methods.

asfimport · 2012-04-16T08:35:13Z

Uwe Schindler (@uschindler) (migrated from JIRA)

We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow).

I propose to adda CharFilter consistency method that reads two instances of the same CharFilter, one using read() and in parallel using read(char[]) with varying buffer sizes. It should check offsets (and that is what is heavy buggy in MappingCharOffsetCorrumpter / MockCharCorrumpter).

I'll prepare a patch with the test method in BaseTokenStreamTestCase.

asfimport · 2012-04-16T10:52:54Z

Robert Muir (@rmuir) (migrated from JIRA)

Hi Uwe, i tried these same things, I know the interface is horrible but that wasnt really the bug here.

If we want to fix CharFilter/CharStream i think we should use #3862.

That removes CharFilter/CharStream/CharReader and replaces with just one CharFilter class that extends FilterReader and is reusable. I think its a simpler way to go at least to only have one class...

asfimport · 2012-04-16T10:54:51Z

Robert Muir (@rmuir) (migrated from JIRA)

We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow).

We don't need to do all that actually. I think its enough if we have a CharFilter or Tokenizer that
randomly uses read(CharBuffer)/read(char[])/read()/read(char[], int,int)...

the other methods should be tested too.

asfimport · 2012-07-11T23:03:42Z

Chris M. Hostetter (@hossman) (migrated from JIRA)

bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

asfimport · 2012-08-07T03:41:24Z

Robert Muir (@rmuir) (migrated from JIRA)

rmuir20120906-bulk-40-change

asfimport · 2013-07-23T18:44:48Z

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport · 2014-04-16T12:54:48Z

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

asfimport assigned uschindler Aug 23, 2022

asfimport mentioned this issue Aug 24, 2022

more thorough testing of analysis chains [LUCENE-3919] #4992

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestRandomChains failure caused by incorrect delegation in CharReader/CharFilter/CharStream API [LUCENE-3990] #5063

TestRandomChains failure caused by incorrect delegation in CharReader/CharFilter/CharStream API [LUCENE-3990] #5063

asfimport commented Apr 15, 2012 •

edited

Loading

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012 •

edited

Loading

asfimport commented Apr 16, 2012

asfimport commented Jul 11, 2012

asfimport commented Aug 7, 2012

asfimport commented Jul 23, 2013

asfimport commented Apr 16, 2014

TestRandomChains failure caused by incorrect delegation in CharReader/CharFilter/CharStream API [LUCENE-3990] #5063

TestRandomChains failure caused by incorrect delegation in CharReader/CharFilter/CharStream API [LUCENE-3990] #5063

Comments

asfimport commented Apr 15, 2012 • edited Loading

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012

asfimport commented Apr 16, 2012 • edited Loading

asfimport commented Apr 16, 2012

asfimport commented Jul 11, 2012

asfimport commented Aug 7, 2012

asfimport commented Jul 23, 2013

asfimport commented Apr 16, 2014

asfimport commented Apr 15, 2012 •

edited

Loading

asfimport commented Apr 16, 2012 •

edited

Loading