-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separately specify a field's type [LUCENE-2308] #3384
Comments
Chris Male (migrated from JIRA) Hi Mike, +1 to this idea. Do you envisage FieldType instances being immutable or would you be able to change the Analyzer on a FieldType? If they are mutable, would you see FieldType instances being shared across multiple Fields? Or would each Field have its own FieldType instance? |
Michael McCandless (@mikemccand) (migrated from JIRA) I think immutable & shareable across Field instances for sure and presumably also across different fields? And maybe we should have some hierarchy, eg analyzed or not. I think it's important that we contain this to the baby steps (eg not ambitiously make a huge type hierarchy) – it really is just pulling out the "type-like" configuration from Field, leaving just the actual value of the field on Field. |
Chris Male (migrated from JIRA) Yeah I agree with the immutability and shareability. I'll give this a shot, with taking the babiest of baby steps. |
Robert Muir (@rmuir) (migrated from JIRA)
personal pet peeve, i wonder if we could consider improving on 'omit' here, |
Chris Male (migrated from JIRA) So you are thinking more along the lines indexNorms(true|false)? |
Robert Muir (@rmuir) (migrated from JIRA)
or whatever you come up with, that doesn't create double-negatives! |
Chris Male (migrated from JIRA) I agree entirely. This is definitely the moment to remove any ambiguity or confusion in this API. I'll make sure to incorporate this idea. |
Marvin Humphrey (migrated from JIRA) I think we might consider matchOnly() instead of omitNorms(). If a field is Haven't got a good synonym for "omitTFAP", but I'd sure like one. |
Shai Erera (@shaie) (migrated from JIRA) How about enable(TYPE/FEATURE) and corresponding disable? So Type/Feature will have NORMS, TF, POSITIONS and calls would look like: |
Robert Muir (@rmuir) (migrated from JIRA) Just also to mention (probably too much for this one issue)! I think it would be nice of OmitTF was separately selectable |
Marvin Humphrey (migrated from JIRA) If you disable term freq, you also have to disable positions. The "freq" I think it's asking an awful lot of our users to require that they understand |
Robert Muir (@rmuir) (migrated from JIRA)
Marvin: as stated, we would have to actually implement this.
I don't know what I did to piss you off, but I just thought it would be nice |
Marvin Humphrey (migrated from JIRA) I'm simply suggesting that the proposed API is too hard to understand. Most users know whether their fields can be "match-only" but have no idea what I'm not a fan of omitTFAP, omitTF, omitNorms, omitPositions, or omit(flags). |
Chris Male (migrated from JIRA) What I covered with Mike earlier was whether FieldType methods would be immutable or not. If they are, which seems a good idea, then everything will be enabled/disabled in the construction of the FieldType so we would only need to support property getter methods. |
Michael McCandless (@mikemccand) (migrated from JIRA) Hmm one challenge with making FieldType immutable is.... we don't want It would be nice if we could do something similar to IndexWriterConfig I'm torn on naming: yes, search-oriented names like "matchOnly" is I'm not sure we can/should find better index-time names. What is |
Marvin Humphrey (migrated from JIRA) > Also creating a FieldType with args like Agreed Another option would be a "flags" integer and bitwise constants: FieldType type = new FieldType(analyzer, FieldType.INDEXED | FieldType.STORED); > It would be nice if we could do something similar to IndexWriterConfig I bet that'll be more popular than flags, but I thought it was worth |
Earwin Burrfoot (migrated from JIRA) I'm strongly against names like 'matchOnly'. They are perfectly fine in some 'schema' layer over Lucene, but here, in lowlevel guts, I'd prefer names that clearly state what the hell do they do, without forcing me to consult javadocs/code. |
Yonik Seeley (@yonik) (migrated from JIRA) For the non-expert user, it's just a label and won't have much meaning regardless of what it's called, and they will need to consult the docs. Of course, if one starts to dig deeper, "norms" actually does have a physical meaning in the index, so preferring a label with "norms" in it seems completely reasonable. There's also history to consider - when you change the name of something, you cut the link to the past in search engines, and in the memories of many developers. As it relates to Solr - I don't care so much since it makes sense for the Solr schema to isolate these changes and stick with "omitNorms" regardless. |
Chris Male (migrated from JIRA)
Yeah we could use something like a FieldTypeBuilder which could provide a fluid interface for specifying each property, which then get built into an immutable FieldType at the end. |
Yonik Seeley (@yonik) (migrated from JIRA) I'm not sure if strict immutability is necessary - there's everything in between too. Unrelated question: I assume that this would retain the same flexibility as we have today... the ability to change FieldType for field "foo" from one document to the next? |
Chris Male (migrated from JIRA)
I'm really unsure about this if people are going to be using a FieldType instance with multiple Fields. Perhaps this really is just an edge case though.
Are you wanting to be able to reuse the same Field instance in both documents while defining separate FieldTypes? Or is creating new Field instances okay? |
Yonik Seeley (@yonik) (migrated from JIRA)
I will, if I can (provided the FieldType does not contain the field name). That shouldn't have anything to do with immutability though.
new Field instances should be fine - it's not really my use case anyway. But we're designing for the 1000's of use cases that are out there and we should be careful about adding new constraints. |
Chris Male (migrated from JIRA)
Yeah the field name will stay inside the Field. To me the reuse issue relates immutability in that a change to a property in one FieldType after construction means the change effects all the Fields that use that type. But as you say, if we document that its best to set everything at instantiation and that whatever happens after that is undefined, then I imagine it'll be fine.
Yeah I appreciate that this API will be used in lots of different ways. Baby steps as Mike said :) But to answer your question, yes the flexibility will remain. |
Yonik Seeley (@yonik) (migrated from JIRA) Of course... given that Fieldable is an interface, one could create an implementation that just delegated all the calls like omitNorms to a shared instance, except for the value part. Add a getAnalyzer() method to Fieldable, and it's the same thing in the end? |
David Smiley (@dsmiley) (migrated from JIRA) I'm surprised to barely even see a mention to Solr here which already, of course obviously, already has a FieldType. Might it be ported? |
Simon Willnauer (@s1monw) (migrated from JIRA) Brief Summary for GSoC Students: FieldType aims on the one hand to separate field properties from the |
Simon Willnauer (@s1monw) (migrated from JIRA)
Moving stuff from Solr to Lucene involves lots of politics. It is way easier to let Solr adopt eventually than fight your way through the politics (this is my opinion though.) |
Robert Muir (@rmuir) (migrated from JIRA)
Then why do we still have merged codebases? because as a lucene developer, i spend a lot of time trying to do my part to fix various things in Solr... if its a one-way-street then we need to un-merge. |
Chris Male (migrated from JIRA)
I think there is a lot of overlap but Solr's FieldTypes also integrate with its schema through SchemaField so maybe its an option to port the overlap and then let Solr extend whatever is created, to provide its schema integration/Solr specific functions? |
Yonik Seeley (@yonik) (migrated from JIRA)
Yeah, that seems reasonable. |
Chris Male (migrated from JIRA) I would like it to be but Field ensures any FieldType passed in is frozen by calling freeze() which is a CoreFieldType notion. This is sort of messy and is a concern I have with the freezable state idea. If we removed this and let Field assume state was immutable/frozen/whatever then we could use the interface yes. |
Chris Male (migrated from JIRA) Anyone else have any thoughts? Any objections to committing this patch as a first step? |
Yonik Seeley (@yonik) (migrated from JIRA) Instead of introducing a dependency on CoreFieldType in many places (only to have to change it back later when some sort of consensus is finally reached), it would seem much cleaner to either
The other changes in the patch look fine. |
Chris Male (migrated from JIRA) Patch updated following Yonik's advice. I'd removed the freeze() calls from Field so that it can now accept a FieldType instance. If freezing is important, its up to the created of the CoreFieldType. |
Michael McCandless (@mikemccand) (migrated from JIRA) I guess it's OK to remove the auto-freeze from Field... it's sort of sneaky to do that. But, this means we've opened up the trap (where users change a FT after using it on a field, thinking/assuming somehow that the Field took a copy). Chris can you fix the jdocs on Field ctors to make it clear that the Field instances holds a ref to the provided FT and so any changes later made to that FT will affect the Field instance? |
Michael McCandless (@mikemccand) (migrated from JIRA) Should we also move numeric(), numericDataType() and maybe I also like Marvin's/Robert's suggestion of using int flags for all We lost the jdocs on each of the boolean methods (indexed(), stored(), Maybe name oal.index's FT to IndexableFieldType? And then drop Core from Also fix the jdocs for CoreFT.freeze – it still claims Field will |
Chris Male (migrated from JIRA) I will make the appropriate javadoc changes right now.
Yup.
I like them too. Lets do that.
Good idea. Its going to reduce this patch size considerably. |
Chris Male (migrated from JIRA) New patch based on the feedback from Mike.
We're all green so I'm looking to commit this shortly and spin off the remaining changes. |
Uwe Schindler (@uschindler) (migrated from JIRA) I am not green but gave up due to vacation. I am still against freeze, but my complaints are ignored. |
Chris Male (migrated from JIRA) Far from it Uwe, your complaints are being actively taken into consideration and we have every intention to open a new issue to move away from freeze (see Mike's comments). I'm just wanting to take one step at a time. |
Uwe Schindler (@uschindler) (migrated from JIRA)
This sentence is too funny :-) I don't agree and I am not happy with the whole stuff. As Simon seems to be silent, so there is nothing I can do anymore with my limited time. I still favour the builder approach, and this API looks like the old one coming back... |
Chris Male (migrated from JIRA) Final patch before committing. This includes a change to MIGRATE.txt. I notice we don't have a CHANGES.txt entry anywhere, so I'll add that upon committing |
Chris Male (migrated from JIRA) Committed revision 1167668. |
Yonik Seeley (@yonik) (migrated from JIRA) We're coming up on 4.0, and it doesn't seem like there ever was a consensus here wrt immutability. |
Simon Willnauer (@s1monw) (migrated from JIRA) can we close this issue? seems like except of yoniks last comment everything else has been resolved? |
Chris M. Hostetter (@hossman) (migrated from JIRA) bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment |
Robert Muir (@rmuir) (migrated from JIRA) rmuir20120906-bulk-40-change |
Commit Tag Bot (migrated from JIRA) [branch_4x commit] Michael McCandless LUCENE-2308: add MIGRATE.txt entry about Document.setBoost |
Uwe Schindler (@uschindler) (migrated from JIRA) Closed after release. |
This came up from dicussions on IRC. I'm summarizing here...
Today when you make a Field to add to a document you can set things
index or not, stored or not, analyzed or not, details like omitTfAP,
omitNorms, index term vectors (separately controlling
offsets/positions), etc.
I think we should factor these out into a new class (FieldType?).
Then you could re-use this FieldType instance across multiple fields.
The Field instance would still hold the actual value.
We could then do per-field analyzers by adding a setAnalyzer on the
FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
for per-field codecs (with flex), where we now have
PerFieldCodecWrapper).
This would NOT be a schema! It's just refactoring what we already
specify today. EG it's not serialized into the index.
This has been discussed before, and I know Michael Busch opened a more
ambitious (I think?) issue. I think this is a good first baby step. We could
consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
off on that for starters...
Migrated from LUCENE-2308 by Michael McCandless (@mikemccand), 2 votes, resolved Mar 18 2013
Attachments: LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch (versions: 5), LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-FT-interface.patch (versions: 4), LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch
Linked issues:
Sub-tasks:
The text was updated successfully, but these errors were encountered: