Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separately specify a field's type [LUCENE-2308] #3384

Closed
asfimport opened this issue Mar 10, 2010 · 200 comments
Closed

Separately specify a field's type [LUCENE-2308] #3384

asfimport opened this issue Mar 10, 2010 · 200 comments

Comments

@asfimport
Copy link

asfimport commented Mar 10, 2010

This came up from dicussions on IRC. I'm summarizing here...

Today when you make a Field to add to a document you can set things
index or not, stored or not, analyzed or not, details like omitTfAP,
omitNorms, index term vectors (separately controlling
offsets/positions), etc.

I think we should factor these out into a new class (FieldType?).
Then you could re-use this FieldType instance across multiple fields.

The Field instance would still hold the actual value.

We could then do per-field analyzers by adding a setAnalyzer on the
FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
for per-field codecs (with flex), where we now have
PerFieldCodecWrapper).

This would NOT be a schema! It's just refactoring what we already
specify today. EG it's not serialized into the index.

This has been discussed before, and I know Michael Busch opened a more
ambitious (I think?) issue. I think this is a good first baby step. We could
consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
off on that for starters...


Migrated from LUCENE-2308 by Michael McCandless (@mikemccand), 2 votes, resolved Mar 18 2013
Attachments: LUCENE-2308.branchdiffs, LUCENE-2308.branchdiffs.moved, LUCENE-2308.patch (versions: 5), LUCENE-2308-10.patch, LUCENE-2308-11.patch, LUCENE-2308-12.patch, LUCENE-2308-13.patch, LUCENE-2308-14.patch, LUCENE-2308-15.patch, LUCENE-2308-16.patch, LUCENE-2308-17.patch, LUCENE-2308-18.patch, LUCENE-2308-19.patch, LUCENE-2308-2.patch, LUCENE-2308-20.patch, LUCENE-2308-21.patch, LUCENE-2308-3.patch, LUCENE-2308-4.patch, LUCENE-2308-5.patch, LUCENE-2308-6.patch, LUCENE-2308-7.patch, LUCENE-2308-8.patch, LUCENE-2308-9.patch, LUCENE-2308-branch.patch, LUCENE-2308-final.patch, LUCENE-2308-FT-interface.patch (versions: 4), LUCENE-2308-ltc.patch, LUCENE-2308-merge-1.patch, LUCENE-2308-merge-2.patch, LUCENE-2308-merge-3.patch
Linked issues:

Sub-tasks:

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

Hi Mike,

+1 to this idea.

Do you envisage FieldType instances being immutable or would you be able to change the Analyzer on a FieldType? If they are mutable, would you see FieldType instances being shared across multiple Fields? Or would each Field have its own FieldType instance?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I think immutable & shareable across Field instances for sure and presumably also across different fields?

And maybe we should have some hierarchy, eg analyzed or not.

I think it's important that we contain this to the baby steps (eg not ambitiously make a huge type hierarchy) – it really is just pulling out the "type-like" configuration from Field, leaving just the actual value of the field on Field.

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

Yeah I agree with the immutability and shareability.

I'll give this a shot, with taking the babiest of baby steps.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

details like omitTfAP, omitNorms

personal pet peeve, i wonder if we could consider improving on 'omit' here,
I think things like omit(false), disable(false) are a little awkward.

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

So you are thinking more along the lines indexNorms(true|false)?

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

So you are thinking more along the lines indexNorms(true|false)?

or whatever you come up with, that doesn't create double-negatives!
but yeah, i think something like that is a little easier... no big deal
just figured I would bring it up if this stuff was getting refactored anyway

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

I agree entirely. This is definitely the moment to remove any ambiguity or confusion in this API. I'll make sure to incorporate this idea.

@asfimport
Copy link
Author

Marvin Humphrey (migrated from JIRA)

I think we might consider matchOnly() instead of omitNorms(). If a field is
"match only", we don't need boost bytes a.k.a. "norms" because they are only
used as a scoring multiplier.

Haven't got a good synonym for "omitTFAP", but I'd sure like one.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

How about enable(TYPE/FEATURE) and corresponding disable? So Type/Feature will have NORMS, TF, POSITIONS and calls would look like:
f.enable(Type.NORMS), f.disable(Type.TF)?

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

Just also to mention (probably too much for this one issue)!

I think it would be nice of OmitTF was separately selectable
from OmitPositions, as Shai implied. We would have to
actually implement this though I think!

@asfimport
Copy link
Author

Marvin Humphrey (migrated from JIRA)

If you disable term freq, you also have to disable positions. The "freq"
tells you how many positions there are.

I think it's asking an awful lot of our users to require that they understand
all the implications of posting format modifications when committers
have difficulty mastering all the subtleties.

@asfimport
Copy link
Author

asfimport commented Mar 12, 2010

Robert Muir (@rmuir) (migrated from JIRA)

If you disable term freq, you also have to disable positions. The "freq"
tells you how many positions there are.

Marvin: as stated, we would have to actually implement this.
There's an issue open for it too: #3123.
I was just discussing this with someone the other day.

I think it's asking an awful lot of our users to require that they understand
all the implications of posting format modifications when committers
have difficulty mastering all the subtleties.

I don't know what I did to piss you off, but I just thought it would be nice
for completeness, to mention that this feature is still open and its
something we should think about.

@asfimport
Copy link
Author

Marvin Humphrey (migrated from JIRA)

I'm simply suggesting that the proposed API is too hard to understand.

Most users know whether their fields can be "match-only" but have no idea what
TFAP is. And even advanced users will have difficulty understanding all the
implications for matching and scoring when they selectively disable portions
of the posting format.

I'm not a fan of omitTFAP, omitTF, omitNorms, omitPositions, or omit(flags).
Something that ordinary users can grok would be used more often and more
effectively.

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

What I covered with Mike earlier was whether FieldType methods would be immutable or not.

If they are, which seems a good idea, then everything will be enabled/disabled in the construction of the FieldType so we would only need to support property getter methods.

@asfimport
Copy link
Author

asfimport commented Mar 12, 2010

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm one challenge with making FieldType immutable is.... we don't want
a zillion ctors over time. Also creating a FieldType with args like
new FieldType(true, false, false) isn't really readable.

It would be nice if we could do something similar to IndexWriterConfig
(#3370), where you use incremental ctor/setters to set up the
configuration but then once it's used ("bound" to a Field), it's
immutable.

I'm torn on naming: yes, search-oriented names like "matchOnly" is
tempting, but then we really should tease apart termFreq and positions
(they are stuck together now with omitTFAP). And the two are not
fully independent as Marvin noted – so maybe we use a cryptic enum
(DOCS, DOCS_TERM_FREQ, DOCS_TERM_FREQ_POSITIONS)? If we can only find
better names...

I'm not sure we can/should find better index-time names. What is
stored in the index is relatively independent from how/whether
searches make use of it. EG if you store termFreq (but not positions)
you can still do match only searching, or, you can do full scoring of
the query. You can't use positional queries.

@asfimport
Copy link
Author

asfimport commented Mar 12, 2010

Marvin Humphrey (migrated from JIRA)

> Also creating a FieldType with args like
> new FieldType(true, false, false) isn't really readable.

Agreed Another option would be a "flags" integer and bitwise constants:

FieldType type = new FieldType(analyzer, FieldType.INDEXED | FieldType.STORED);

> It would be nice if we could do something similar to IndexWriterConfig
> (#3370), where you use incremental ctor/setters to set up the
> configuration but then once it's used ("bound" to a Field), it's
> immutable.

I bet that'll be more popular than flags, but I thought it was worth
bringing it up anyway. :)

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

I'm strongly against names like 'matchOnly'. They are perfectly fine in some 'schema' layer over Lucene, but here, in lowlevel guts, I'd prefer names that clearly state what the hell do they do, without forcing me to consult javadocs/code.

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

For the non-expert user, it's just a label and won't have much meaning regardless of what it's called, and they will need to consult the docs. Of course, if one starts to dig deeper, "norms" actually does have a physical meaning in the index, so preferring a label with "norms" in it seems completely reasonable.

There's also history to consider - when you change the name of something, you cut the link to the past in search engines, and in the memories of many developers.

As it relates to Solr - I don't care so much since it makes sense for the Solr schema to isolate these changes and stick with "omitNorms" regardless.

@asfimport
Copy link
Author

asfimport commented Mar 12, 2010

Chris Male (migrated from JIRA)

It would be nice if we could do something similar to IndexWriterConfig
(#3370), where you use incremental ctor/setters to set up the
configuration but then once it's used ("bound" to a Field), it's
immutable.

Yeah we could use something like a FieldTypeBuilder which could provide a fluid interface for specifying each property, which then get built into an immutable FieldType at the end.

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

I'm not sure if strict immutability is necessary - there's everything in between too.
One can simply say that all changes should be made before first use, and after that point it's undefined.

Unrelated question: I assume that this would retain the same flexibility as we have today... the ability to change FieldType for field "foo" from one document to the next?

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

I'm not sure if strict immutability is necessary - there's everything in between too.
One can simply say that all changes should be made before first use, and after that point it's undefined.

I'm really unsure about this if people are going to be using a FieldType instance with multiple Fields. Perhaps this really is just an edge case though.

Unrelated question: I assume that this would retain the same flexibility as we have today... the ability to change FieldType for field "foo" from one document to the next?

Are you wanting to be able to reuse the same Field instance in both documents while defining separate FieldTypes? Or is creating new Field instances okay?

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

I'm really unsure about this if people are going to be using a FieldType instance with multiple Fields.

I will, if I can (provided the FieldType does not contain the field name). That shouldn't have anything to do with immutability though.

Are you wanting to be able to reuse the same Field instance in both documents while defining separate FieldTypes? Or is creating new Field instances okay?

new Field instances should be fine - it's not really my use case anyway. But we're designing for the 1000's of use cases that are out there and we should be careful about adding new constraints.

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

I will, if I can (provided the FieldType does not contain the field name). That shouldn't have anything to do with immutability though.

Yeah the field name will stay inside the Field. To me the reuse issue relates immutability in that a change to a property in one FieldType after construction means the change effects all the Fields that use that type.

But as you say, if we document that its best to set everything at instantiation and that whatever happens after that is undefined, then I imagine it'll be fine.

new Field instances should be fine - it's not really my use case anyway. But we're designing for the 1000's of use cases that are out there and we should be careful about adding new constraints.

Yeah I appreciate that this API will be used in lots of different ways. Baby steps as Mike said :) But to answer your question, yes the flexibility will remain.

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

Of course... given that Fieldable is an interface, one could create an implementation that just delegated all the calls like omitNorms to a shared instance, except for the value part. Add a getAnalyzer() method to Fieldable, and it's the same thing in the end?

@asfimport
Copy link
Author

David Smiley (@dsmiley) (migrated from JIRA)

I'm surprised to barely even see a mention to Solr here which already, of course obviously, already has a FieldType. Might it be ported?

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

Brief Summary for GSoC Students:

FieldType aims on the one hand to separate field properties from the
actual value and on the other make Field's extensibility easier. Both
seem equally important while far from easy to achieve. Fieldable and
Field are a core API and changes to it need to well thought. Further
this issue can easily cause drastic performance degradation if not
done right. Consider this as a massive change since fields are used
almost all over lucene and solr.

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

I'm surprised to barely even see a mention to Solr here which already, of course obviously, already has a FieldType. Might it be ported?

Moving stuff from Solr to Lucene involves lots of politics. It is way easier to let Solr adopt eventually than fight your way through the politics (this is my opinion though.)

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

Moving stuff from Solr to Lucene involves lots of politics. It is way easier to let Solr adopt eventually than fight your way through the politics (this is my opinion though.)

Then why do we still have merged codebases?
If this is the way things are, then we should un-merge the two projects.

because as a lucene developer, i spend a lot of time trying to do my part to fix various things in Solr... if its a one-way-street then we need to un-merge.

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

I'm surprised to barely even see a mention to Solr here which already, of course obviously, already has a FieldType. Might it be ported?

I think there is a lot of overlap but Solr's FieldTypes also integrate with its schema through SchemaField so maybe its an option to port the overlap and then let Solr extend whatever is created, to provide its schema integration/Solr specific functions?

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

I think there is a lot of overlap but Solr's FieldTypes also integrate with its schema through SchemaField so maybe its an option to port the overlap and then let Solr extend whatever is created, to provide its schema integration/Solr specific functions?

Yeah, that seems reasonable.

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

I would like it to be but Field ensures any FieldType passed in is frozen by calling freeze() which is a CoreFieldType notion. This is sort of messy and is a concern I have with the freezable state idea. If we removed this and let Field assume state was immutable/frozen/whatever then we could use the interface yes.

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

Anyone else have any thoughts? Any objections to committing this patch as a first step?

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

Instead of introducing a dependency on CoreFieldType in many places (only to have to change it back later when some sort of consensus is finally reached), it would seem much cleaner to either

  • remove freeze() until we decide on the right approach
  • move freeze() to the FieldType interface temporarily (and remove it later if the approach changes)

The other changes in the patch look fine.

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

Patch updated following Yonik's advice. I'd removed the freeze() calls from Field so that it can now accept a FieldType instance. If freezing is important, its up to the created of the CoreFieldType.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I guess it's OK to remove the auto-freeze from Field... it's sort of sneaky to do that.

But, this means we've opened up the trap (where users change a FT after using it on a field, thinking/assuming somehow that the Field took a copy). Chris can you fix the jdocs on Field ctors to make it clear that the Field instances holds a ref to the provided FT and so any changes later made to that FT will affect the Field instance?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Should we also move numeric(), numericDataType() and maybe
docValuesType() into oal.index.FieldType? (We can do this as a
speparate issue though).

I also like Marvin's/Robert's suggestion of using int flags for all
these booleans (also a separate issue!).

We lost the jdocs on each of the boolean methods (indexed(), stored(),
etc.).

Maybe name oal.index's FT to IndexableFieldType? And then drop Core from
oal.document's? Ie, oal.document.FieldType and
oal.index.IndexableFieldType? (Aren't we going to shortly need
oal.index.StorableFieldType?).

Also fix the jdocs for CoreFT.freeze – it still claims Field will
auto-freeze.

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

I will make the appropriate javadoc changes right now.

Should we also move numeric(), numericDataType() and maybe
docValuesType() into oal.index.FieldType? (We can do this as a
speparate issue though).

Yup.

I also like Marvin's/Robert's suggestion of using int flags for all
these booleans (also a separate issue!).

I like them too. Lets do that.

Maybe name oal.index's FT to IndexableFieldType? And then drop Core from
oal.document's? Ie, oal.document.FieldType and
oal.index.IndexableFieldType? (Aren't we going to shortly need
oal.index.StorableFieldType?).

Good idea. Its going to reduce this patch size considerably.

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

New patch based on the feedback from Mike.

  • Field now includes a class level jdoc saying its recommended no changes are made to FieldTypes after a Field is created.
  • FieldType is now IndexableFieldType and CoreFieldType has gone back to FieldType.
  • FieldType.freeze() no longer mentions auto-freezing, however it does recommend freeze() is called once properties have been set.

We're all green so I'm looking to commit this shortly and spin off the remaining changes.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

I am not green but gave up due to vacation. I am still against freeze, but my complaints are ignored.

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

Far from it Uwe, your complaints are being actively taken into consideration and we have every intention to open a new issue to move away from freeze (see Mike's comments). I'm just wanting to take one step at a time.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

I am not green but gave up due to vacation. I am still against freeze, but my complaints are ignored.

This sentence is too funny :-) I don't agree and I am not happy with the whole stuff. As Simon seems to be silent, so there is nothing I can do anymore with my limited time. I still favour the builder approach, and this API looks like the old one coming back...

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

Final patch before committing. This includes a change to MIGRATE.txt.

I notice we don't have a CHANGES.txt entry anywhere, so I'll add that upon committing

@asfimport
Copy link
Author

Chris Male (migrated from JIRA)

Committed revision 1167668.

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

We're coming up on 4.0, and it doesn't seem like there ever was a consensus here wrt immutability.
I'm also still in favor of removing freeze.

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

can we close this issue? seems like except of yoniks last comment everything else has been resolved?

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

rmuir20120906-bulk-40-change

@asfimport
Copy link
Author

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Michael McCandless
http://svn.apache.org/viewvc?view=revision&revision=1389535

LUCENE-2308: add MIGRATE.txt entry about Document.setBoost

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Closed after release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment