Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create IndexWriterConfiguration and store all of IW configuration there [LUCENE-2294] #3370

Closed
asfimport opened this issue Mar 4, 2010 · 57 comments

Comments

@asfimport
Copy link

asfimport commented Mar 4, 2010

I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters:

  • All of ctors parameters, except for Directory.
  • The different setters where it makes sense. For example I still think infoStream should be set on IW directly.

I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter.

I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 10000 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ...

I plan to deprecate all the ctors and getters/setters and replace them by:

  • One ctor as described above
  • getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest.
  • About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig);
    • The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2());

BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API.

I'll start to work on a patch.


Migrated from LUCENE-2294 by Shai Erera (@shaie), resolved Mar 13 2010
Attachments: check.py, LUCENE-2294.patch (versions: 6)
Linked issues:

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

+1 for the IndexWriterConfig with chaining method calls

We had a discussion about this a while ago on the mailinglist: http://www.lucidimagination.com/search/document/19e1a19f4d340b8c/lucene_2_9_and_deprecated_ir_open_methods

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Thanks Uwe for the pointer.

I suppose that MaxFieldLength should now move to IndexWriterConfig, only it is public and therefore needs to be deprecated. Otherwise it will look strange that in order to set MFL on IWC you need to reference IW. So deprecate and duplicate? To IW it doesn't matter because it just takes the limit (int) from MFL ...

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

In my opinion the whole class is unneeded, so only deprecate in IW but not add it to IWConfig. For me a constant in IWConfig would be enough that defines UNLIMITED and everything else is just an integer. +1 for defaulting to "static final UNLIMITED=Integer.MAX_VALUE". I am not sure why this limitation is there at all. In my opinion it should be left to the apploication to limit the number of tokens if needed, but not silently drop tokens. If somebody gets an OOM, he can adjust the value and knows that mayabe some tokens get lost.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Yeah I don't like it either (makes my code unnecessarily long). And I always use UNLIMITED, and the LIMITED=10,000 is really just a guess, and so if anyone wants to limit it, he needs to do new MaxFieldLength(otherLimit) which is unnecessarily long as well ...

I like it - I'll deprecate on IW and introduce UNLIMITED on IWC.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

I was wondering if perhaps instead of allowing to pass a create=true/false, we should use an enum with 3 values: CREATE, APPEND, CREATE_OR_APPEND. The current meaning of create is a bit unclear. I.e. if it is true, then overwrite. But if it is false, don't attempt to create, but just open an existing one. However if the directory is empty, it throws an exception. I think an enum would someone to pass CREATE_OR_APPEND in case he doesn't know if there is an index there ... but I don't want to complicate things unnecessarily ... what do you think?

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

IndexingChain is one of the things that can be set on IW, however I don't see any implementations of it besides the default, and the class itself is package-private, so no app could actually set it on IW (unless it puts its code under o.a.l.index). Therefore I'm thinking of not introducing it on IWC, or turn it to a public class?
Is it really something we expect any application out there to set, or can we simply make DocsWriter impl one for itself internally, and don't declare this class as abstract etc.?

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

I'm thinking to make this whole IWC a constructor only parameter to IW, without the ability to set it afterwards. I don't see any reason why would anyone change the RAM limit, Similarity etc while IW is running. What's the advantage vs. say close the current IW and open a new one with the different settings? I know the latter is more expensive, and I write it deliberately - I think those settings are really ctor-only settings. Otherwise you might get inconsistent documents (like changing the Similarity or max field length).

This will also simplify IWC, because now I need to distinguish between settings that cannot be altered afterwards, like changing IndexDeletionPolicy, create, IndexCommit, Analyzer ... if IWC will be a ctor only object, I can have only the default ctor (to init to default settings) and provide the setters otherwise.

Any objections?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 – this is great!

I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength).

This is because it's a dangerous setting (you silently lose content
while indexing), a trap. So we want to force the user to make the
choice, up front, so they realize the implications.

But, if we change the default to UNLIMITED (which we should do under
Version), then I agree you should not have to specify it.

In my opinion it should be left to the apploication to limit the number of tokens if needed, but not silently drop tokens

I like that approach – we could make a TokenFilter to do this? Then
we don't need MFL at all in IWC (and deprecate in IW).

I was wondering if perhaps instead of allowing to pass a create=true/false, we should use an enum with 3 values: CREATE, APPEND, CREATE_OR_APPEND

+1

I'm thinking to make this whole IWC a constructor only parameter to IW, without the ability to set it afterwards.

+1 in general, though we should go setting by setting to confirm this is OK. I
don't know of "real" use cases where apps eg want to change RAM buffer
or mergeFactor... but maybe there are some interesting usages out
there.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

But, if we change the default to UNLIMITED

Today there is no DEFAULT .. IW forces you to pass MFL so whoever moves to the new API can define whatever he wants. We'll default to UNLIMITED but there won't be any back-compat issue ...

we could make a TokenFilter to do this?

I'm afraid that will result in changing all Analyzers to work properly? Or you mean DW (or somewhere) will wrap whatever TS an Analyzer returns w/ this filter? That could work, but as soon as that becomes a filter, people may use it, and wrapping their TS w/ that filter will be unnecessary (and slow 'em down?). Also, if I'd use such a filter myself, I wouldn't put it last in the chain, so that I can avoid doing any processing on a term that is not going to end up in the index. Although that's not too critical because I'll be doing this for just one term ...

I guess I'd like to keep it as it is now, not turning the issue into a bigger thing ... and a filter alone won't solve it - we'd still need to provide a way to configure it, or otherwise everyone will need to wrap their Analyzers with such filter?

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them.

Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats required to set on the config class - without the same cstr business ...

edit

Though I suppose the chaining does makes this more swallowable...

new IW(new IWConfig(Analyzer).set().set().set()) isn't really so bad ...

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

I wouldn't worry about what's required - Directory will be left out, MFL is useless and a pain anyway, so what's left is Analyzer. I can put Analyzer on IWC's ctor, but I personally think we can default to a simple one (such as Whitespace) encouraging the people to set their own. I find it very annoying today when I want to test something about IW and I need to pass all these things to IW ...

The way I see it, those who want to rely on Lucene's latest and greatest can just do: IndexWriter writer = new IndexWriter(dir, new IWC()); Well maybe except for the Analyzer, but I really don't think it matters that much. And like you wrote, someone can chain the setters. So win-win? If you don't care about anything, just wants to open a writer, index something and that's it, you don't need to specify anything .. otherwise you just chain calls?

One thing I should add to IWC so far (I hope to post a patch even today) is a Version parameter. For now it will be ignored, but as a placeholder to change settings in the future.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Today there is no DEFAULT .. IW forces you to pass MFL so whoever moves to the new API can define whatever he wants. We'll default to UNLIMITED but there won't be any back-compat issue ..

Ahh sorry right. In the "olden days", 10000 was the default.

we could make a TokenFilter to do this?

I'm afraid that will result in changing all Analyzers to work properly? Or you mean DW (or somewhere) will wrap whatever TS an Analyzer returns w/ this filter? That could work, but as soon as that becomes a filter, people may use it, and wrapping their TS w/ that filter will be unnecessary (and slow 'em down?).

Hmm yeah quite a hassle to fix all analyzers. Hmmm.

I guess I'd like to keep it as it is now, not turning the issue into a bigger thing ... and a filter alone won't solve it - we'd still need to provide a way to configure it, or otherwise everyone will need to wrap their Analyzers with such filter?

Maybe one solution is to wrap any other analyzer? Ie, create a StopAfterNTokensAnalyzer, taking another analyzer that it delegates to, and then sticking on this StopAfterNTokensFilter to each token stream.

But yeah maybe break this out as a separate issue...

Also, if I'd use such a filter myself, I wouldn't put it last in the chain, so that I can avoid doing any processing on a term that is not going to end up in the index. Although that's not too critical because I'll be doing this for just one term ...

Actually it ought to be 0 terms wasted, with the filter @ the end – with this StopAfterNTokensFilter, it'll immediately return false w/o asking for the 10001th token.

One thing I should add to IWC so far (I hope to post a patch even today) is a Version parameter. For now it will be ignored, but as a placeholder to change settings in the future.

+1

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

There are some settings on IW that go directly to the MergePolicy used (well ... only if it's LogMergePolicy). At first I wanted to move them, together w/ MP, to IWC, but this isn't possible, because MP requires IW to be passed to its ctor, and when IWC is created, there is no IW ... So there are couple of options I can think of:

  • Add all to IWC, and expose on MP a final setIndexWriter, package-private, as well as an empty ctor. IW will later set MP's IW. It will also allow applications to pass their own MP after IW has been created (like they can do today). IWC will delegate all set/get to MP, like IW does today. The only downside is that the IW member of MP won't be final anymore ...

  • Keep the get/set on IW, like it is today. It will split the configuration of IW to two places ... I prefer to keep it all in one place.

  • Remove all the set/get from IW and let applications interface directly with MP. It will remove the bizarre situation where someone could pass a non LogMergePolicy to IW, however when he'll try to set anything on it from IW, it will fail ... If we remove the get/set from IW and let folks interface w/ their MP directly, we won't have this problem:

    • They'll need to declare the MP of type LogMP, or otherwise the API won't be there.
    • if they use their own MP, with special API, they'll interface w/ MP's configuration from one place ...

Between the three, I like (3) the best as it will get rid of the inconsistency in IW today (expecting only LogMP, but requiring MP). But if those set/get are important to keep together with the other configuration, I prefer (1). What do you think?

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

I voted for killing these delegating methods some time ago. It ended in nothing, so I vote again, #3 :)

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

Yay, we'll be able to remove SolrIndexConfig and use this :-)

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Ok, then I'll proceed w/ #3.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

So this will affect these methods in IW, right:

get/setMergeFactor(...)
get/setUseCompoundFile(...)
get/setMaxMergeDocs(...)

Ie, we'd deprecate them. Instead you'd have to interact /w the MergePolicy. IW will still default its merge policy, so with this change, instead of:

writer.setMergeFactor(7);

You'd have to do:

((LogMergePolicy) writer.getMergePolicy()).setMergeFactor(7));

(And take the risk of runtime cast yourself/explicitly, which IW was already doing under the hood anyway).

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Exactly !

Only I hope that if people will interact w/ their MP to set stuff on it, they would interact with it to query its settings. Calling IW.getMergePolicy will become redundant, or at least will be avoidable. Maybe only when you rely on the default MP, you'll perform the last code example you gave above. Actually, I've already done 98% of the change. I need to finish some stuff on IW and then I'll post the patch.

On the MFL/Analyzer - I like the approach of wrapping an Analyzer with a MFLAnalyzer. The beauty is that if I don't want to limit anything, I don't need to do anything and the code won't need to keep track of how many tokens were indexed for that field ... I still think it should be done in a separate issue, as it involves introducing some code in the lower levels to wrap any Analyzer with it, for back-compat. If we do it, let's do it before 3.1 is out, so we can remove it from IWC right away w/o deprecating anything ...

I'll open an issue for it.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Patch includes:

  • IndexWriterConfig + Test
  • Changes to IndexWriter to use IWC
  • Changes to DocumentsWriter - removed IndexingChain from its ctor and always use default.

IndexingChain is package-private and so was the ctor on IndexWriter which allowed to set it. If that's important (e.g. used by SOLR maybe?), then I can add a package-private setting to IndexWriterConfig which will allow setting this. But since I haven't found any calls in Lucene (tests nor java), and no implementations of this class besides the default that is in DW, I thought this can be removed.

I would like to have this patched reviewed, before I go about changing all the code to use the new IWC (tests + java). If we agree on the details and how it looks and used, I'll do it - should be straightforward.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Ha, I see TODO 4.0 comments now. Sheesh ;)

Patch looks great Shai!

A few comments:

  • Hmm... I think we should still allow package private specification
    of the indexing chain? Not long ago Michael B. explicitly opened
    this up (while keeping it package private). I think for
    way-expert users we should keep it extensible?

  • For IWC.setAnalyzer(null), instead of silently swapping in
    WhitespaceAnalyzer, can we throw IllegalArgE?

  • Should we disallow all setters on IWC after it's been consumed by
    an IW? (Though one may want to re-use to make another IW).
    Or... make it very clear that IW fully consumes the IWC on
    creation and never again looks at it...? (OK, I do see this in
    IW.getConfig and IWC). Should IW just clone the IWC and reference
    it internally, instead of copying to private fields? Or... IW
    marks the IWC readOnly (and setters on it then fail), and you must
    clone it if you will re-use?

  • In IWC we call it "scheduler" but seems like we should keep the
    name "mergeScheduler"? (And in toString, too).

  • Why can't MergePolicy also live in IWC? Seems odd to move
    MergeScheduler to it but not MP. (And deprecate set/getMP in IW)

  • Nit-picky, but, IW.messageState will have a space before each of
    its entries but then IWC's entries have no space – looks not so
    nice on enabling infoStream?

  • Need small jdoc for the new IW ctor

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Hmm... I think we should still allow package private specification of the indexing chain?

Ok, I'll add it back.

For IWC.setAnalyzer(null)

I preferred not to (it's not just setAnalyzer, but also MS, Similarity and IndexDeletionPolicy). I specifically documented on each what happens if one passes null. I think it's better service - you're not expected to pass null, and so if you do, instead of throwing an exception, we revert to default. I thought at some point to add a restoreDefaults() method, but then realized that doing new IWC() is not that expensive ...
I think - if someone uses IWC but receives any of those settings from the outside, instead of asking him to always check for null, we tell him "pass null, we revert to default" ...

In IWC we call it "scheduler"

Woops, fixed that !

Why can't MergePolicy also live in IWC?

I've commented on it in this issue - MP requires an IW instance to be passed to its ctor.(see my comment from 04/Mar/10 09:28 PM).

IW.messageState will have a space before each of its entries

IWC.toString() includes '\n' between settings. I thought it'd be more readable that way because otherwise it'd be a long line. The output for me looks like that:

IW 0 [main]: dir=org.apache.lucene.store.RAMDirectory@2ca22ca2 mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@6ca06ca index= version=3.1-dev config=matchVersion=LUCENE_31
analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer
delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
commit=null
openMode=CREATE_OR_APPEND
maxFieldLength=2147483647
similarity=org.apache.lucene.search.DefaultSimilarity
termIndexInterval=128
mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler
default WRITE_LOCK_TIMEOUT=1000
writeLockTimeout=1000
maxBufferedDeleteTerms=-1
ramBufferSizeMB=16.0
maxBufferedDocs=-1

Perhaps I'll print "config=\n" so that all config parameters start on the new line, and not all but the first. Is that acceptable, or you still prefer all to be on one line, space separated?

Need small jdoc for the new IW ctor

I'll add that. That one slipped :).

Should we disallow all setters on IWC after it's been consumed by an IW?

I've thought about all the options that you raise, and decided to keep the situation as-is, to let others also comment on that. So I'm glad you commented :).

  • I've documented on IW.getConfig() that setting anything on the returned object has no effect on IW instance, and if one needs to do it, one should re-instantiate IW.
  • I also thought to turn off setters (setting IWC to read-only by IW), but then someone won't be able to reuse an IWC
  • Removing the fields from IW is not good, because then someone could really call getConfig().setMaxBufferedDocs and that will affect IW, unlike what the comment says. If we want to really keep everything in IWC, we need to clone on ctor and getConfig() time. Seems a waste to me.

I think I'll just clone the incoming IWC on IW and leave the rest as-is?

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Patch with applied comments. I'll start moving the code to use the new ctor, class etc.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

OK let's keep the semantics of "if you pass null that means restore to default".

Should we rename UNLIMITED_FIELD_LENGTH -> DEFAULT_MAX_FIELD_LENGTH? (Though w/ the new issue this should be going away from IWC anyway).

IWC.toString() includes '\n' between settings

Ahh, I misread it (I thought IW was doing \n followed by single space). I agree multiple lines is much better. How about making IW also do multiple lines for its non-IWC settings that are messaged? And then removing the config= from the output? Ie all I see is a bunch of settings. I don't think [in this debug printout] that we need to message which setting was on IW and which was from the IWC.

I've commented on it in this issue - MP requires an IW instance to be passed to its ctor.(see my comment from 04/Mar/10 09:28 PM).

Ahh, sorry, right. Annoying we moved IW as required into MP :( Hrmph.

I think I'll just clone the incoming IWC on IW and leave the rest as-is?

OK, I think this is fine, for starters. The API semantics are clearly defined, so, if at a later date we harden the enforcement (say throw IllegalStateEx if you try to change anything after IWC has been consumed by IW, that's fair game).

Maybe clearly state in the jdocs that presently no settings are version dependent, so it's just a placeholder? (Basically, elevate the code comment that already says this, into the jdocs). Whenever a class in Lucene takes a Version, we should strongly document how Version alters settings, even if it's a no-op placeholder. (And I agree we should have it here as placeholder).

Patch looks good Shai, thanks!

@asfimport
Copy link
Author

asfimport commented Mar 6, 2010

Shai Erera (@shaie) (migrated from JIRA)

UNLIMITED_FIELD_LENGTH

I deliberately named it like that. There is currently IW.DEFAULT_MAX_FIELD_LENGTH which is set to 10,000. I didn't want to keep the same name so as to not confuse. The default, UNLIMITED, is documented on setMaxFieldLength.

But like you said, it may go entirely away in #3371.

Annoying we moved IW as required into MP

I think I'm to blame for this :).

How about making IW also do multiple lines for its non-IWC setting

Done.

Thanks for the comments. Continuing with converting the code. I'm flying to the US tonight, so I may not finish it before I leave. And then I'll be gone for ~30 hours before I can ping again :).

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

While I'm converting the tests to use the new IWC, and getting rid of deprecated methods, I think it'll be useful to add a MergePolicyConfig. But since most of the setters used today assume a LogMergePolicy, and since it's also unrelated to IW, I prefer to leave it for a separate issue. Doing that, w/ the methods chaining approach, will simplify usage even more.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Patch with all tests and code converted to not use the deprecated API, and move to use IWC. All except CreateIndexTask in benchmark, which looks a bit more complicated to change, I left it out for now.

Another thing - I deprecated *QueryParserWrapperTest and did not convert them to use the new IWC API, as those tests will be removed when *QueryParserWrapper will be removed.

All tests pass.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good – thanks Shai!

I'll commit in a day or two.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Phew! Thanks Shai...

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Robert raised a concern on java-dev: maybe we should not have a default analyzer....? Ie force the user to make an explicit choice.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

Here's my elaboration a bit:

I like the concept of good, versioned defaults, but i think this is just a decision point that
a user should have to make, like it is now.

A practical concern that I have: Whitespace really sucks, we will get lots of questions
on java-user because "lucene doesnt search upper/lowercase by default". This is because
beginner users will only do what work is necessary for their application to compile.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I'll revert the commit while we discuss...

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

IW defaults to all other bunch of settings, so why is Analyzer different?

Just my opinion, but what makes it different is that breaking text into tokens
and indexing them is what search is all about.

Deletions is an optional feature, and if someone got rid of it alltogether, I wouldn't
even comment on the issue, I could care less.

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

Question - does SOLR requires everyone to specify an Analyzer, or does it come w/ a default one?

Hmm... SOLR doesn't really use Lucene analyzers.

It comes with a default Schema.xml that defines FieldTypes. Then field names can be assigned to FieldTypes. So technically speaking, no, Solr does not - but because most
people build off the example, you could say that it does have defaults for example FieldTypes and defaults of what field names map to those. But it also only accepts certain example fields with the example Schema - you really
have to go in and customize it to your needs - its setup to basically show off what options are available and work with some demo stuff.

Solr comes with almost no defaults in a way - but it does ship with an example setup that is meant to show you how to set things up, and what is available. You could consider those defaults since most will build off it.

example of Solr analyzer declaration:

    <!-- A general unstemmed text field - good if one does not know the language of the field -->
    <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Really, eventually, I'd like a stronger separation of analysis and
indexing, so that IndexWriter never even gets an analyzer.

Ie, the fields in the doc that need to be indexed should only present
an attr source, and IW pulls from that.

IW is then fully agnostic to how that attr source was created – one
need not even use analyzers/tokenizer/tokenfilter approach
at all (create something custom).

IW now has a lot of embedded logic to figure out how to get the attr
source – not analyzed fields, analyzed but it's a Reader vs a String
vs pre-analyzed, etc.

But until then I agree it's dangerous to default the analyzer – user
should be explicit. It's similar to how we promoted field truncation
(max field length) to be an explicit choice, because it's a dangerous
default (because it silently alters what makes it into your index).

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

what makes it different is that breaking text into tokens and indexing them is what search is all about.

I agree, and Whitespace is easy to explain and understand. Who said that Standard or Simple is better, for example (I know you don't say it)? I just don't understand why we should force everyone to select an Analyzer even when they don't want to, or care.

Remember that all the people out there today already set an Analyzer, so when they port to the new API they'll continue setting it (unless they didn't care but were forced to do it). New users ... well I hope they read jdocs.

It's good that SOLR doesn't force people to specify the Analyzer ... smart decision, for play/example purposes.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Mike - regarding getting rid of Analyzer on IW. Today it's very convenient that I can specify my default Analyzer for all ANALYZED fields, for all documents. I don't understand how a Field can introduce an Analyzer - do I now set an Analyzer on every field I add to a Document? Wouldn't that be extremely inconvenient?

If you want to get rid of Analyzer on IW I'm fine. We can throw away the addDocument(Doc) method and always force a user to pass an Analyzer. Much better than requiring him to pass it on every field he adds to a document ...

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

I'm assuming you would set an Analyzer for the document - and then you could override per field - or something along those lines.

@asfimport
Copy link
Author

asfimport commented Mar 10, 2010

Michael McCandless (@mikemccand) (migrated from JIRA)

I opened #3385 for decoupling IW from analysis. IW would only need to know about attr source.

@asfimport
Copy link
Author

asfimport commented Mar 10, 2010

Michael McCandless (@mikemccand) (migrated from JIRA)

If we did #3385 then IWC wouldn't need an Analyzer, default or otherwise ;)

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

I'm assuming you would set an Analyzer for the document - and then you could override per field - or something along those lines.

Isn't that like calling IW.addDocument(Doc, Analyzer) where some of the fields have pre-defined TokenStream? What's the difference then? Why change the current API (besides getting rid of the addDoc(Doc) method)?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Isn't that like calling IW.addDocument(Doc, Analyzer) where some of the fields have pre-defined TokenStream?

Well, where all of the fields have pre-defined TokenStream, or, can produce it on demand with no other args (ie that field instance knows its analyzer).

What's the difference then? Why change the current API (besides getting rid of the addDoc(Doc) method)?

Well it'd simplify your work here, for one :)

But it'd also put distance between IW and analysis, which I think is useful for others to use Lucene with their own "means" of producing tokens. Tying IW to less concrete impls also increases our independence which makes cross-version compatibility easier, over time. The more independent the various parts of Lucene are the more we can factor things out...

@asfimport
Copy link
Author

asfimport commented Mar 10, 2010

Shai Erera (@shaie) (migrated from JIRA)

Well it'd simplify your work here, for one

I'm afraid that's too late no? I've already completed that work ... it's that late realization that forces me to re-do it again ...

I don't mind if IW does not know about Analyzers, I'm perfectly fine w/ it (and even like it).

Suggestion: since re-doing all that work, meaning adding Whitespace to the instantiation in all the places that really don't care, is very expensive and time consuming on my part, if #3385 is resolved before 3.1, can we keep IWC as-is, and as part of 2309 remove setAnalyzer from IWC? It will keep the ctor's signature as it is now, and we won't need to change all of these classes again in 2309. Only those that explicitly call setAnalyzer.

Is that acceptable?

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Patch w/ IWC requiring an Analyzer in its ctor. All tests pass

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Shai! I'll look through it...

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Shai – that was fast adding back in the analyzers :)

A few things:

  • Should we also move MergedSegmentWarmer to IWC?

  • This adds more Version.LUCENE_CURRENTs in contrib... but I don't
    see any way around that (and it's good to add the placeholder to
    IWC).

  • There were some tests that previously passed null as analyzers,
    that you changed to WhitespaceAnalyzer, which seems OK.

  • I see you cleaned some stuff up in the process (fixed "extends
    TestCase" -> extends "LuceneTestCase", use TEST_VERSION_CURRENT
    consistently instead using class member, added Version to ctor for
    analyzers that now require them, noting that test is deprecated
    since it only tests deprecated class, fixed places where tests
    passed null as analyzer) – good!

  • There are a number of tests where the analyzers were changed, eg
    Simple > Whitespace or Standard -> Whitespace - I attached
    check.py (run it w/ 1 arg which is path to the patch file). Note
    that not all things it finds are actually changed (ie it has some
    false pos's, but I think not too many). Tests still pass... but
    it makes me a bit nervous. Shouldn't we just keep the same
    analyzer?

@asfimport
Copy link
Author

asfimport commented Mar 11, 2010

Shai Erera (@shaie) (migrated from JIRA)

Should we also move MergedSegmentWarmer to IWC?

I guess it could ... wonder how I missed it ...

This adds more Version.LUCENE_CURRENTs in contrib...

I added it on purpose, so we know to convert these places when we remove LUCENE_CURRENT. We can also do this gradually in #3381.

There were some tests that previously passed null as analyzers

Those were exactly the cases I wanted to move Analyzer out of the mandatory list. Whatever you pass will work for those tests, because they obviously don't rely on analysis ... otherwise they'd had been hit by NPE.
Hmm ... I wonder if I can still pass 'null' for them. I used eclipse search/replace tool to add the Analyzer to IWC calls. The null argument disappeared completely because they used the default, so the search/replace tool replaced that w/ Whitespace. I prefer to keep it that way, to avoid re-changing them again.

There are a number of tests where the analyzers were changed ...

I thought that that's ok, because those tests don't rely on the output of analysis, or otherwise they should have tested/asserted it. Since they don't, it shouldn't matter to them. But I'll revert that change.

So two things I need to do:

  1. revert the changes to those tests
  2. add warmer to

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Patch updated w/:

  1. Move setMergedSegmentWarmer to IWC.
  2. Revert the changes reported by check.py

Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Shai, I'll look...

Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them?

Yes if they are indeed false positives...

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them?

Hmm some of these (at least TestAtomicUpdate was changed from Simple -> Whitespace) were in fact real changes.... I'll fix & post a new patch.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached new patch, just fixing a couple tests where analyzer had changed.

I it's ready to commit (take 2)! I'll wait a day or two...

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Thanks Mike. I ran the tool once, fix all that it complained. Then 2nd time it found some more (probably some I missed in the 1st pass), only this time really few more. So I fixed them as well. But I didn't run it 3rd time :) ...

I can't wait for this to be in ... an exhausting issue ;).

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

I can't wait for this to be in ... an exhausting issue

Shai, thanks for taking the time to redo this massive patch. I'm sorry
again I dropped the ball and didn't notice till the commit, forcing you
to redo a lot of work.

+1

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Thanks a lot Robert for reviewing this. No harm done ... I've had the chance to exercise some of Eclipse tricks in the process or re-doing. Unfortunately it introduced some changes, but luckily we have Mike and his mighty-python-scripting-ability to protect us :).

Now I have a bunch of other issues I need to open, that were waiting for this guy to go in. Stay tuned

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Take 2! Thanks Shai.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants