-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create IndexWriterConfiguration and store all of IW configuration there [LUCENE-2294] #3370
Comments
Uwe Schindler (@uschindler) (migrated from JIRA) +1 for the IndexWriterConfig with chaining method calls We had a discussion about this a while ago on the mailinglist: http://www.lucidimagination.com/search/document/19e1a19f4d340b8c/lucene_2_9_and_deprecated_ir_open_methods |
Shai Erera (@shaie) (migrated from JIRA) Thanks Uwe for the pointer. I suppose that MaxFieldLength should now move to IndexWriterConfig, only it is public and therefore needs to be deprecated. Otherwise it will look strange that in order to set MFL on IWC you need to reference IW. So deprecate and duplicate? To IW it doesn't matter because it just takes the limit (int) from MFL ... |
Uwe Schindler (@uschindler) (migrated from JIRA) In my opinion the whole class is unneeded, so only deprecate in IW but not add it to IWConfig. For me a constant in IWConfig would be enough that defines UNLIMITED and everything else is just an integer. +1 for defaulting to "static final UNLIMITED=Integer.MAX_VALUE". I am not sure why this limitation is there at all. In my opinion it should be left to the apploication to limit the number of tokens if needed, but not silently drop tokens. If somebody gets an OOM, he can adjust the value and knows that mayabe some tokens get lost. |
Shai Erera (@shaie) (migrated from JIRA) Yeah I don't like it either (makes my code unnecessarily long). And I always use UNLIMITED, and the LIMITED=10,000 is really just a guess, and so if anyone wants to limit it, he needs to do new MaxFieldLength(otherLimit) which is unnecessarily long as well ... I like it - I'll deprecate on IW and introduce UNLIMITED on IWC. |
Shai Erera (@shaie) (migrated from JIRA) I was wondering if perhaps instead of allowing to pass a create=true/false, we should use an enum with 3 values: CREATE, APPEND, CREATE_OR_APPEND. The current meaning of create is a bit unclear. I.e. if it is true, then overwrite. But if it is false, don't attempt to create, but just open an existing one. However if the directory is empty, it throws an exception. I think an enum would someone to pass CREATE_OR_APPEND in case he doesn't know if there is an index there ... but I don't want to complicate things unnecessarily ... what do you think? |
Shai Erera (@shaie) (migrated from JIRA) IndexingChain is one of the things that can be set on IW, however I don't see any implementations of it besides the default, and the class itself is package-private, so no app could actually set it on IW (unless it puts its code under o.a.l.index). Therefore I'm thinking of not introducing it on IWC, or turn it to a public class? |
Shai Erera (@shaie) (migrated from JIRA) I'm thinking to make this whole IWC a constructor only parameter to IW, without the ability to set it afterwards. I don't see any reason why would anyone change the RAM limit, Similarity etc while IW is running. What's the advantage vs. say close the current IW and open a new one with the different settings? I know the latter is more expensive, and I write it deliberately - I think those settings are really ctor-only settings. Otherwise you might get inconsistent documents (like changing the Similarity or max field length). This will also simplify IWC, because now I need to distinguish between settings that cannot be altered afterwards, like changing IndexDeletionPolicy, create, IndexCommit, Analyzer ... if IWC will be a ctor only object, I can have only the default ctor (to init to default settings) and provide the setters otherwise. Any objections? |
Michael McCandless (@mikemccand) (migrated from JIRA) +1 – this is great!
This is because it's a dangerous setting (you silently lose content But, if we change the default to UNLIMITED (which we should do under
I like that approach – we could make a TokenFilter to do this? Then
+1
+1 in general, though we should go setting by setting to confirm this is OK. I |
Shai Erera (@shaie) (migrated from JIRA)
Today there is no DEFAULT .. IW forces you to pass MFL so whoever moves to the new API can define whatever he wants. We'll default to UNLIMITED but there won't be any back-compat issue ...
I'm afraid that will result in changing all Analyzers to work properly? Or you mean DW (or somewhere) will wrap whatever TS an Analyzer returns w/ this filter? That could work, but as soon as that becomes a filter, people may use it, and wrapping their TS w/ that filter will be unnecessary (and slow 'em down?). Also, if I'd use such a filter myself, I wouldn't put it last in the chain, so that I can avoid doing any processing on a term that is not going to end up in the index. Although that's not too critical because I'll be doing this for just one term ... I guess I'd like to keep it as it is now, not turning the issue into a bigger thing ... and a filter alone won't solve it - we'd still need to provide a way to configure it, or otherwise everyone will need to wrap their Analyzers with such filter? |
Mark Miller (@markrmiller) (migrated from JIRA) I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them. Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats required to set on the config class - without the same cstr business ... edit Though I suppose the chaining does makes this more swallowable... new IW(new IWConfig(Analyzer).set().set().set()) isn't really so bad ... |
Shai Erera (@shaie) (migrated from JIRA) I wouldn't worry about what's required - Directory will be left out, MFL is useless and a pain anyway, so what's left is Analyzer. I can put Analyzer on IWC's ctor, but I personally think we can default to a simple one (such as Whitespace) encouraging the people to set their own. I find it very annoying today when I want to test something about IW and I need to pass all these things to IW ... The way I see it, those who want to rely on Lucene's latest and greatest can just do: IndexWriter writer = new IndexWriter(dir, new IWC()); Well maybe except for the Analyzer, but I really don't think it matters that much. And like you wrote, someone can chain the setters. So win-win? If you don't care about anything, just wants to open a writer, index something and that's it, you don't need to specify anything .. otherwise you just chain calls? One thing I should add to IWC so far (I hope to post a patch even today) is a Version parameter. For now it will be ignored, but as a placeholder to change settings in the future. |
Michael McCandless (@mikemccand) (migrated from JIRA)
Ahh sorry right. In the "olden days", 10000 was the default.
Hmm yeah quite a hassle to fix all analyzers. Hmmm.
Maybe one solution is to wrap any other analyzer? Ie, create a StopAfterNTokensAnalyzer, taking another analyzer that it delegates to, and then sticking on this StopAfterNTokensFilter to each token stream. But yeah maybe break this out as a separate issue...
Actually it ought to be 0 terms wasted, with the filter
+1 |
Shai Erera (@shaie) (migrated from JIRA) There are some settings on IW that go directly to the MergePolicy used (well ... only if it's LogMergePolicy). At first I wanted to move them, together w/ MP, to IWC, but this isn't possible, because MP requires IW to be passed to its ctor, and when IWC is created, there is no IW ... So there are couple of options I can think of:
Between the three, I like (3) the best as it will get rid of the inconsistency in IW today (expecting only LogMP, but requiring MP). But if those set/get are important to keep together with the other configuration, I prefer (1). What do you think? |
Earwin Burrfoot (migrated from JIRA) I voted for killing these delegating methods some time ago. It ended in nothing, so I vote again, |
Yonik Seeley (@yonik) (migrated from JIRA) Yay, we'll be able to remove SolrIndexConfig and use this :-) |
Shai Erera (@shaie) (migrated from JIRA) Ok, then I'll proceed w/ #3. |
Michael McCandless (@mikemccand) (migrated from JIRA) So this will affect these methods in IW, right: get/setMergeFactor(...)
get/setUseCompoundFile(...)
get/setMaxMergeDocs(...) Ie, we'd deprecate them. Instead you'd have to interact /w the MergePolicy. IW will still default its merge policy, so with this change, instead of: writer.setMergeFactor(7); You'd have to do: ((LogMergePolicy) writer.getMergePolicy()).setMergeFactor(7)); (And take the risk of runtime cast yourself/explicitly, which IW was already doing under the hood anyway). |
Shai Erera (@shaie) (migrated from JIRA) Exactly ! Only I hope that if people will interact w/ their MP to set stuff on it, they would interact with it to query its settings. Calling IW.getMergePolicy will become redundant, or at least will be avoidable. Maybe only when you rely on the default MP, you'll perform the last code example you gave above. Actually, I've already done 98% of the change. I need to finish some stuff on IW and then I'll post the patch. On the MFL/Analyzer - I like the approach of wrapping an Analyzer with a MFLAnalyzer. The beauty is that if I don't want to limit anything, I don't need to do anything and the code won't need to keep track of how many tokens were indexed for that field ... I still think it should be done in a separate issue, as it involves introducing some code in the lower levels to wrap any Analyzer with it, for back-compat. If we do it, let's do it before 3.1 is out, so we can remove it from IWC right away w/o deprecating anything ... I'll open an issue for it. |
Shai Erera (@shaie) (migrated from JIRA) Patch includes:
IndexingChain is package-private and so was the ctor on IndexWriter which allowed to set it. If that's important (e.g. used by SOLR maybe?), then I can add a package-private setting to IndexWriterConfig which will allow setting this. But since I haven't found any calls in Lucene (tests nor java), and no implementations of this class besides the default that is in DW, I thought this can be removed. I would like to have this patched reviewed, before I go about changing all the code to use the new IWC (tests + java). If we agree on the details and how it looks and used, I'll do it - should be straightforward. |
Michael McCandless (@mikemccand) (migrated from JIRA) Ha, I see TODO 4.0 comments now. Sheesh ;) Patch looks great Shai! A few comments:
|
Shai Erera (@shaie) (migrated from JIRA)
Ok, I'll add it back.
I preferred not to (it's not just setAnalyzer, but also MS, Similarity and IndexDeletionPolicy). I specifically documented on each what happens if one passes null. I think it's better service - you're not expected to pass null, and so if you do, instead of throwing an exception, we revert to default. I thought at some point to add a restoreDefaults() method, but then realized that doing new IWC() is not that expensive ...
Woops, fixed that !
I've commented on it in this issue - MP requires an IW instance to be passed to its ctor.(see my comment from 04/Mar/10 09:28 PM).
IWC.toString() includes '\n' between settings. I thought it'd be more readable that way because otherwise it'd be a long line. The output for me looks like that: IW 0 [main]: dir=org.apache.lucene.store.RAMDirectory@2ca22ca2 mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@6ca06ca index= version=3.1-dev config=matchVersion=LUCENE_31
analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer
delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
commit=null
openMode=CREATE_OR_APPEND
maxFieldLength=2147483647
similarity=org.apache.lucene.search.DefaultSimilarity
termIndexInterval=128
mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler
default WRITE_LOCK_TIMEOUT=1000
writeLockTimeout=1000
maxBufferedDeleteTerms=-1
ramBufferSizeMB=16.0
maxBufferedDocs=-1 Perhaps I'll print "config=\n" so that all config parameters start on the new line, and not all but the first. Is that acceptable, or you still prefer all to be on one line, space separated?
I'll add that. That one slipped :).
I've thought about all the options that you raise, and decided to keep the situation as-is, to let others also comment on that. So I'm glad you commented :).
I think I'll just clone the incoming IWC on IW and leave the rest as-is? |
Shai Erera (@shaie) (migrated from JIRA) Patch with applied comments. I'll start moving the code to use the new ctor, class etc. |
Michael McCandless (@mikemccand) (migrated from JIRA) OK let's keep the semantics of "if you pass null that means restore to default". Should we rename UNLIMITED_FIELD_LENGTH -> DEFAULT_MAX_FIELD_LENGTH? (Though w/ the new issue this should be going away from IWC anyway).
Ahh, I misread it (I thought IW was doing \n followed by single space). I agree multiple lines is much better. How about making IW also do multiple lines for its non-IWC settings that are messaged? And then removing the config= from the output? Ie all I see is a bunch of settings. I don't think [in this debug printout] that we need to message which setting was on IW and which was from the IWC.
Ahh, sorry, right. Annoying we moved IW as required into MP :( Hrmph.
OK, I think this is fine, for starters. The API semantics are clearly defined, so, if at a later date we harden the enforcement (say throw IllegalStateEx if you try to change anything after IWC has been consumed by IW, that's fair game). Maybe clearly state in the jdocs that presently no settings are version dependent, so it's just a placeholder? (Basically, elevate the code comment that already says this, into the jdocs). Whenever a class in Lucene takes a Version, we should strongly document how Version alters settings, even if it's a no-op placeholder. (And I agree we should have it here as placeholder). Patch looks good Shai, thanks! |
Shai Erera (@shaie) (migrated from JIRA)
I deliberately named it like that. There is currently IW.DEFAULT_MAX_FIELD_LENGTH which is set to 10,000. I didn't want to keep the same name so as to not confuse. The default, UNLIMITED, is documented on setMaxFieldLength. But like you said, it may go entirely away in #3371.
I think I'm to blame for this :).
Done. Thanks for the comments. Continuing with converting the code. I'm flying to the US tonight, so I may not finish it before I leave. And then I'll be gone for ~30 hours before I can ping again :). |
Shai Erera (@shaie) (migrated from JIRA) While I'm converting the tests to use the new IWC, and getting rid of deprecated methods, I think it'll be useful to add a MergePolicyConfig. But since most of the setters used today assume a LogMergePolicy, and since it's also unrelated to IW, I prefer to leave it for a separate issue. Doing that, w/ the methods chaining approach, will simplify usage even more. |
Shai Erera (@shaie) (migrated from JIRA) Patch with all tests and code converted to not use the deprecated API, and move to use IWC. All except CreateIndexTask in benchmark, which looks a bit more complicated to change, I left it out for now. Another thing - I deprecated *QueryParserWrapperTest and did not convert them to use the new IWC API, as those tests will be removed when *QueryParserWrapper will be removed. All tests pass. |
Michael McCandless (@mikemccand) (migrated from JIRA) Patch looks good – thanks Shai! I'll commit in a day or two. |
Michael McCandless (@mikemccand) (migrated from JIRA) Phew! Thanks Shai... |
Michael McCandless (@mikemccand) (migrated from JIRA) Robert raised a concern on java-dev: maybe we should not have a default analyzer....? Ie force the user to make an explicit choice. |
Robert Muir (@rmuir) (migrated from JIRA) Here's my elaboration a bit: I like the concept of good, versioned defaults, but i think this is just a decision point that A practical concern that I have: Whitespace really sucks, we will get lots of questions |
Michael McCandless (@mikemccand) (migrated from JIRA) I'll revert the commit while we discuss... |
Robert Muir (@rmuir) (migrated from JIRA)
Just my opinion, but what makes it different is that breaking text into tokens Deletions is an optional feature, and if someone got rid of it alltogether, I wouldn't |
Mark Miller (@markrmiller) (migrated from JIRA)
Hmm... SOLR doesn't really use Lucene analyzers. It comes with a default Schema.xml that defines FieldTypes. Then field names can be assigned to FieldTypes. So technically speaking, no, Solr does not - but because most Solr comes with almost no defaults in a way - but it does ship with an example setup that is meant to show you how to set things up, and what is available. You could consider those defaults since most will build off it. example of Solr analyzer declaration: <!-- A general unstemmed text field - good if one does not know the language of the field -->
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType> |
Michael McCandless (@mikemccand) (migrated from JIRA) Really, eventually, I'd like a stronger separation of analysis and Ie, the fields in the doc that need to be indexed should only present IW is then fully agnostic to how that attr source was created – one IW now has a lot of embedded logic to figure out how to get the attr But until then I agree it's dangerous to default the analyzer – user |
Shai Erera (@shaie) (migrated from JIRA)
I agree, and Whitespace is easy to explain and understand. Who said that Standard or Simple is better, for example (I know you don't say it)? I just don't understand why we should force everyone to select an Analyzer even when they don't want to, or care. Remember that all the people out there today already set an Analyzer, so when they port to the new API they'll continue setting it (unless they didn't care but were forced to do it). New users ... well I hope they read jdocs. It's good that SOLR doesn't force people to specify the Analyzer ... smart decision, for play/example purposes. |
Shai Erera (@shaie) (migrated from JIRA) Mike - regarding getting rid of Analyzer on IW. Today it's very convenient that I can specify my default Analyzer for all ANALYZED fields, for all documents. I don't understand how a Field can introduce an Analyzer - do I now set an Analyzer on every field I add to a Document? Wouldn't that be extremely inconvenient? If you want to get rid of Analyzer on IW I'm fine. We can throw away the addDocument(Doc) method and always force a user to pass an Analyzer. Much better than requiring him to pass it on every field he adds to a document ... |
Mark Miller (@markrmiller) (migrated from JIRA) I'm assuming you would set an Analyzer for the document - and then you could override per field - or something along those lines. |
Michael McCandless (@mikemccand) (migrated from JIRA) I opened #3385 for decoupling IW from analysis. IW would only need to know about attr source. |
Michael McCandless (@mikemccand) (migrated from JIRA) If we did #3385 then IWC wouldn't need an Analyzer, default or otherwise ;) |
Shai Erera (@shaie) (migrated from JIRA)
Isn't that like calling IW.addDocument(Doc, Analyzer) where some of the fields have pre-defined TokenStream? What's the difference then? Why change the current API (besides getting rid of the addDoc(Doc) method)? |
Michael McCandless (@mikemccand) (migrated from JIRA)
Well, where all of the fields have pre-defined TokenStream, or, can produce it on demand with no other args (ie that field instance knows its analyzer).
Well it'd simplify your work here, for one :) But it'd also put distance between IW and analysis, which I think is useful for others to use Lucene with their own "means" of producing tokens. Tying IW to less concrete impls also increases our independence which makes cross-version compatibility easier, over time. The more independent the various parts of Lucene are the more we can factor things out... |
Shai Erera (@shaie) (migrated from JIRA)
I'm afraid that's too late no? I've already completed that work ... it's that late realization that forces me to re-do it again ... I don't mind if IW does not know about Analyzers, I'm perfectly fine w/ it (and even like it). Suggestion: since re-doing all that work, meaning adding Whitespace to the instantiation in all the places that really don't care, is very expensive and time consuming on my part, if #3385 is resolved before 3.1, can we keep IWC as-is, and as part of 2309 remove setAnalyzer from IWC? It will keep the ctor's signature as it is now, and we won't need to change all of these classes again in 2309. Only those that explicitly call setAnalyzer. Is that acceptable? |
Shai Erera (@shaie) (migrated from JIRA) Patch w/ IWC requiring an Analyzer in its ctor. All tests pass |
Michael McCandless (@mikemccand) (migrated from JIRA) Thanks Shai! I'll look through it... |
Michael McCandless (@mikemccand) (migrated from JIRA) Patch looks good Shai – that was fast adding back in the analyzers :) A few things:
|
Shai Erera (@shaie) (migrated from JIRA)
I guess it could ... wonder how I missed it ...
I added it on purpose, so we know to convert these places when we remove LUCENE_CURRENT. We can also do this gradually in #3381.
Those were exactly the cases I wanted to move Analyzer out of the mandatory list. Whatever you pass will work for those tests, because they obviously don't rely on analysis ... otherwise they'd had been hit by NPE.
I thought that that's ok, because those tests don't rely on the output of analysis, or otherwise they should have tested/asserted it. Since they don't, it shouldn't matter to them. But I'll revert that change. So two things I need to do:
|
Shai Erera (@shaie) (migrated from JIRA) Patch updated w/:
Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them? |
Michael McCandless (@mikemccand) (migrated from JIRA) Thanks Shai, I'll look...
Yes if they are indeed false positives... |
Michael McCandless (@mikemccand) (migrated from JIRA)
Hmm some of these (at least TestAtomicUpdate was changed from Simple -> Whitespace) were in fact real changes.... I'll fix & post a new patch. |
Michael McCandless (@mikemccand) (migrated from JIRA) Attached new patch, just fixing a couple tests where analyzer had changed. I it's ready to commit (take 2)! I'll wait a day or two... |
Shai Erera (@shaie) (migrated from JIRA) Thanks Mike. I ran the tool once, fix all that it complained. Then 2nd time it found some more (probably some I missed in the 1st pass), only this time really few more. So I fixed them as well. But I didn't run it 3rd time :) ... I can't wait for this to be in ... an exhausting issue ;). |
Robert Muir (@rmuir) (migrated from JIRA)
Shai, thanks for taking the time to redo this massive patch. I'm sorry +1 |
Shai Erera (@shaie) (migrated from JIRA) Thanks a lot Robert for reviewing this. No harm done ... I've had the chance to exercise some of Eclipse tricks in the process or re-doing. Unfortunately it introduced some changes, but luckily we have Mike and his mighty-python-scripting-ability to protect us :). Now I have a bunch of other issues I need to open, that were waiting for this guy to go in. Stay tuned |
Michael McCandless (@mikemccand) (migrated from JIRA) Take 2! Thanks Shai. |
I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters:
I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter.
I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 10000 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ...
I plan to deprecate all the ctors and getters/setters and replace them by:
BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API.
I'll start to work on a patch.
Migrated from LUCENE-2294 by Shai Erera (@shaie), resolved Mar 13 2010
Attachments: check.py, LUCENE-2294.patch (versions: 6)
Linked issues:
The text was updated successfully, but these errors were encountered: