-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Fieldable, AbstractField and Field complexity [LUCENE-2310] #3386
Comments
Michael McCandless (@mikemccand) (migrated from JIRA) +1 |
Chris Male (migrated from JIRA) The challenge presented in this work is the pervasiveness of the Fieldable class. Its used in several hundred places through the source, but the majority are in tests, and in Document itself. Therefore part of this work will be also to move as many of the tests over to using Field, and working on the Document API as well. |
Chris Male (migrated from JIRA) Attaching first version of the patch which deprecates AbstractField.
|
Uwe Schindler (@uschindler) (migrated from JIRA) You should also not be able to set the TokenStream in NF. IMO, i would keep AbstractField and only remove Fieldable, as interfaces are not wanted in Lucene. -1 for this patch in its current form. |
Chris Male (migrated from JIRA)
Yes good point.
Actually I would like to remove both actually. There doesn't seem much reason to keep AbstractField, especially since its already dependent on Field.XYZ and seems only to only store all the various properties, most of which will be moved away to FieldType anyway. Would a compromise be to also add an UOE to setting the TokenStream in NumericField? It does still have the concept of a TokenStream, so it is a Field, but a specialisation which handles the TokenStream itself. |
Tim Smith (migrated from JIRA) Personally, i like keeping Fieldable, (or having AbstractField just with abstract methods and no actual implementation) for feeding documents, i use custom Fieldable implementations to reduce amount of setters called, as Fields of different types have different constant settings |
Earwin Burrfoot (migrated from JIRA) These settings will go to FieldType? |
Chris Male (migrated from JIRA) Hi Tim, Yeah I see what you are saying, but as Earwin says, the 'settings' will be pushed into the FieldType, so they'll be removed from Fieldable as well. |
Chris Male (migrated from JIRA) I should note, to prevent confusion, that my patch is just the beginning of this work, designed to illustrate the direction I'm heading. |
Chris Male (migrated from JIRA) Addressed the issues raised by Uwe about the TokenStream in NumericField. NumericField now throws a UOE on setTokenStream. Since it also extends Field which has its own TokenStream field, NumericField now uses the field from TokenStream rather than its own. The more and more this is discussed the clearer it is that Field should be the base class of the Field hierarchy, and not AbstractField or Fieldable. The issue of having all the setters and configurations will be addressed in LUENE-2308 when we move them all to FieldType. Field will become a simple tuple consisting of at least a value and type, and possibly a TokenStream. NumericField and LazyField are customisations of Field controlling certain aspects of the tuple. For NumericField that is the TokenStream and setting the value. For LazyField that is the value. |
Uwe Schindler (@uschindler) (migrated from JIRA) There is one problem in backwards: AbstractField field = new Field(...) This will no longer work. |
Chris Male (migrated from JIRA) Addressed Uwe's issue again. Only solution is to change Field to extend AbstractField again, even though AbstractField is dead code. Also fixed:
|
Chris Male (migrated from JIRA) Spent more time pondering how to deprecate Fieldable and the major issue is Document#getFields(), which returns a modifiable List<Fieldable>. Because it is modifiable, consumers can add to it directly rather than through Document. If it were unmodifiable, then it would be possible to control adding Fieldables in Document, which would then allow us to wrap Fieldable instances in a Field subclass, meaning Document would only have a List<Field>. Given this limitation, I'm currently thinking about not trying to deprecate Fieldable in 3.x, but instead adding the unmodifiable list method and deprecating #getFields(). I would also add some functionality for removing Fieldables, which seems to be all thats lack in Document. Then in 4.x I would deprecate Fieldable. Slow process, but I think by deprecating AbstractField now we have already made a step forward to improving this hierarchy in preparation for the FieldType classes. Remaining in this work is a code cleanup of all 3 classes, so that code is understandable when we add in FieldType. |
Shai Erera (@shaie) (migrated from JIRA) How about if getFields() will return an Iterable<Field>? It'll need to be deprecated and a new method created, but it can be simple enough solution to avoid one adding fields directly to Document w/o passing through doc.add first. If you do end up removing getFields(), then I'd like to have a clear() on Document. I once proposed it and was told to call getFields().clear() instead, which allows me to reuse my Document instance. If I lose that option, I'd appreciate if a direct clear() on Document will exist instead. |
Chris Male (migrated from JIRA) Hi Shai,
I'm not really in favour of Iterable in this case as it inhibits people calling .size(). Knowing how many fields a Document has might be useful. But I agree it would certainly prevent people from adding new Fields and is a simple solution. Do you think getting the number of fields has a use case at all?
Yes. I will certainly add clear(). Do you think its useful to support removing specific fields as well i.e. remove(String fieldName)? |
Shai Erera (@shaie) (migrated from JIRA) I can't think of why one would want to know how many fields a Document has. But even then, we can add a numFields() method to Document. The Iterable thingy will just prevent a lot of unnecessayy work I think. About remove(String fieldName), I don't see a case for that as well. A Document is a point in time object - you parse a document, extract metadata, converts the entire flow to a Lucene object (in this case Document) and passes it along to IW. You then move on to the next document. So I wouldn't know why would someone want to remove a field from a Document ... but perhaps I just haven't run into the use case yet. Iterable btw will at least alert the user if he did that, because a ConcurrentModificationException will be thrown (if we're using a List-based iterator) ... |
Chris Male (migrated from JIRA) Haha, i whole heartedly agree with all your points. One option instead of adding another method to return Iterable<Fieldable>, is for Document to implement Iterable<Fieldable> and to return a unmodifiable Iterator. This would then fit nicely with future ideas I had about providing iterators based on FieldType criteria, allowing the IW to then retrieve an iterator of only those fields which are to be indexed for example. |
Shai Erera (@shaie) (migrated from JIRA) i like the idea of Document to implement Iterable, but how does that solve the case where someone wants to query how many fields a document has? Will you still have getFields(), only now it will return an unmodifiable collection? I guess the unmod collection can be returned even today, right? BTW, what happens if getFields() return an unmod collection, but someone calls doc.add(Field)? I think the unmod collection prevents you from adding to that collection wrapper, but not for that collection to be changed from under the hood? If that's true, then that could cause some trouble ... so getFields() will really return a snapshot of Document, which means we need to clone Fields ... Gets too complicated no? Maybe just do: (1) Doc implements Iterable and (2) Doc exposes numFIelds(), add(Field)? About remove(field), I thought of a possible scenario though I still don't think it's interesting enough - suppose that you pass your Document through a processing pipeline/chain, each handler adds fields as metadata to the Document. For example, annotators. It might be that a field A exists, only for a handler down the chain to understand A's meaning and then replace it w/ A1 and A2. For that you'll want to be able to move a field ... I guess we could add a remove method to Document, and if it'll be called while the fields are iterated on, a CME will be thrown, which is perfectly fine with me. |
Chris Male (migrated from JIRA) Hi Shai,
It doesn't, but then I'd add a numFields() method maybe. It seems like something with a small use case and so having it has a method on the side seems ideal.
Yes and no. getFields will remain but with a modifiable list. I will then deprecate the method and recommend people use the Iterable. This gives everybody a chance to migrate during the 3.x versions.
Yup lets do that. Unfortunately getFields will remain until 4.0.
With the idea of having remove(...) I am trying to foresee what people might be doing via getFields() and thus are not going to be able to do when its gone. We will have the ability to add and iterate, so having the functionality to remove seems to complete it. I completely agree that if something happens and a CME is thrown, then that problem should be left to the user. I think this clarifies this direction. Document will be changed as follows:
Cheers Shai! |
Shai Erera (@shaie) (migrated from JIRA) I'm sorry for the confusion - I got used to all the deprecation discussions so much that it's embedded in my replies :) - when I wrote "instead getFields" I meant that it will be deprecated, and we'll carry it w/ us until 4.0 is out. So overall we agree on the changes that need to be made. BTW, when you deprecate a method, you usually change it to call the new API or change it to use the new data structures or whatever. So we need to think how to impl getFields such that if one calls remove, numFields or use the iterator on an interleving manner, his code doesn't break ... I don't think it should be hard but it might be a good idea to even write such (deprecated) unit test |
Chris Male (migrated from JIRA)
I'm not sure we have to change getFields. We can just deprecate it, and point people to the new methods. I think it'd be more effort than its worth to create a List impl that calls the new methods. Was that what you were implying? I do agree its worth writing a test to ensure all old functionality can be done via the new methods somehow. |
Shai Erera (@shaie) (migrated from JIRA) That was usually the approach. You provide new methods, deprecate old ones, however both work and not in a XOR mode. Both should work and we need to ensure that if people call both they still function properly. Unless this has changed, in which case it should be clearly documented. But I don't think it is a big problem to support both? If Document still keeps its fields in a List then all should remain the same. We could have a 4.0 note to switch to a Map based DS to better support remove, but that's questionable because we'll need to maintain ordering on the fields (the order in which they inserted) though personally I don't think it should matter much to the user, however that's the current implementation. |
Chris Male (migrated from JIRA) I recommend we keep it as a List since that facilitates having different iterators by FieldType criteria more. A Map would support get and remove better, but I think we want to move people to using Iterators and the remove method is there for a case we don't know of yet. I'll create a patch with these ideas shortly. Cheers! |
Shai Erera (@shaie) (migrated from JIRA) I agree. Then keeping both deprecated and new API should be supported easily. |
Chris Male (migrated from JIRA) New version of the AbstractField deprecation patch which cleans up Field so its more readable, deferring documentation to Fieldable where possible in preparation for deprecating most the methods in Fieldable. |
Chris Male (migrated from JIRA) Attaching patch for changes to Document implementing above ideas discussed above
Having not paid attention earlier, there is already a removeFields method in Document, so it didn't need to be added. There are some greater problems with Document related to this issue. It provides methods for both Fieldables and Fields, getField(String) for example, could throw a ClassCastException if used in combination with AbstractField. Because the overall goal of this work is to reduce these 3 classes to just Field in 4.x, I think we can leave the methods as most will be deprecated. However some documentation improvements might prevent any CCEs. I'll spin off another issue later to do that once this stuff has been committed. |
Chris Male (migrated from JIRA) Attaching patch that migrates core (including tests) away from Document#getFields(). |
Uwe Schindler (@uschindler) (migrated from JIRA) I would make the iterator() unmodifiable (without remove()), so map the list before with Collections.umodifiableList(). Any comments? |
Chris Male (migrated from JIRA) Hi Uwe, Yes you are right. That was one of the original motivations for this work! I will make the appropriate change now. |
Chris Male (migrated from JIRA) Attached new version of Document change which makes the Iterator unmodifiable. Noted this fact in the javadoc for the iterator() method. |
Chris Male (migrated from JIRA) The patch I just attached makes a test in InstantiatedIndex fail because it tries to remove a field through the iterator. This illustrates a concern I have with making the iterator unmodifiable, there is no efficient way to remove a field while iterating, without also running into a CCE. I will think over this issue a little bit before deciding whether to continue with the unmodifiable iterator or not. |
Michael McCandless (@mikemccand) (migrated from JIRA) I wonder if, like we're thinking for analysis, we can substantially Really all indexer needs to do is:
The iterators would be unmodifiable. Indexer could care less how these are implemented. Lucene would still provide a fully featured concrete Document/Field Indexer would not see that class. It would only see the above minimal |
Chris Male (migrated from JIRA) +1 I think this is a great idea. Having a clear separation from the 'core space' where the indexer sits, and the 'user space' gives us the freedom to address the Document/Fieldable class stack problems. We could introduce an IndexableDocument abstraction (which Document would implement) which provides the stripped down API that the indexer needs for Documents, which is just a way to get the fields to index and store. Given that the indexer need only know the name, type and value/tokenstream for a Field, we can visualize this as just a tuple. Therefore the IndexableDocument, rather than returning Fields, could return instances of a new Tuple class, which would be immutable. Field would then implement Tuple. Document and Field would then be exactly as suggested, 'user space' app friendly mutable classes. On the search side, the 'core space' wouldn't know of the idea of a Document, but rather a set of tuples. We would then allow different consumers of those tuples to be provided. One such consumer would use the tuples to build Documents and Fields again. |
Earwin Burrfoot (migrated from JIRA) Yahoo! +1 for introducing barebones interface for the indexer. |
Shai Erera (@shaie) (migrated from JIRA) +1 for this simplification. Can we just name it Indexable, and omit Document from it? That way, it's both shorter and less chances for users to directly link it w/ Document. One thing I didn't understand though, is what will happen to ir/is.doc() method? Will those be deprecated in favor of some other class which receives an IR as parameter and knows how to re-construct Indexable(Document)? |
Chris Male (migrated from JIRA)
+1
I imagine some sort of callback which the IR/IS call with the information they take from the index. One default implementation we could provide would re-construct Indexables, and one could re-construct Documents for the userspace. |
Earwin Burrfoot (migrated from JIRA) Can't we keep interfaces for indexing and retrieval separate? |
Shai Erera (@shaie) (migrated from JIRA) Right Earwin - agreed. I'd like to summarize a brief discussion we had on IRC around that: BTW, I'm not even sure getIndedxedFields can be efficiently supported today. Just listing it here for completeness. |
Chris Male (migrated from JIRA) Earwin, Yeah definitely we want to keep things separate. I am infavour of Indexable being an indexing only construct, staying in the index part of the core space. However Document will be a user space construct that spans both indexing and searching since its API is designed to be flexible and easy to use and manipulate. Writing implements of Indexable will be an expert thing I imagine, so if you index bytes taken from a socket, then I imagine you'll choose an implementation on the search side that can handle it. If your stream from a socket has some custom stuff, then you could write a custom consumer on the search side that builds whatever search construct you want. |
Uwe Schindler (@uschindler) (migrated from JIRA) I am also +1 on the indexer interface. I just repeat myself: We still need TokenStream, an AttributeSource alone is too less. But that is away from that issue: Indexable provides an iterator of fields that consist of name and TokenStream and some options (possibly like omitNorms). If you just dont want to have close() in TokenStream, let's remove it. end() is needed for offsets, the indexer need to take care. incrementToken() is the iterator approach. What else is there? Reset may be invisible to indexer (I would refactor that and would make a subclass of TokenStream that supports reset, ResetableTokenStream - like Tokenizer supports reset(Reader), which is also a subclass). The abstract TokenStream then is only consisting of incrementToken() and end() + the AttributeSource access methods. Attributes needed by indexer are only TermToBytesRefAttribute, PositionIncrementAtt, OffsetAttribute and PayloadAttribute. |
Chris Male (migrated from JIRA) Hi Uwe, I envisaged the Indexable providing Tuples of field name, field type, and TokenStream as it is today. So I think we shouldn't run into any problems? |
Uwe Schindler (@uschindler) (migrated from JIRA) Yeah! |
Chris Male (migrated from JIRA) Reactivating this issue. New patch against 3x (since its all about deprecations). AbstractField and Fieldable are deprecated. Document is changed a little in preparation for using Field. Code cleanup in Field. I'm going to leave the suggested larger changes to Document and Field for a later issue. I really want to get these classes out of trunk so I can then visualize the code better. |
Simon Willnauer (@s1monw) (migrated from JIRA) Hey Chris, good that you reactivate this issue! I was looking into similar stuff while working on docvalues since it really needs to add stuff to Field / Fieldable. With a cleanup and eventually FieldType this would be way less painless I guess. I have a couple of questions and comments to the current patch.
thanks for bringing this up again! |
Chris Male (migrated from JIRA) Thanks for taking a look at this Simon.
Because for me this issue is about reducing the complexity of these classes and Field is a mess. Making it more readable reduces the complexity. If needs be I will do this in two patches, but I don't feel this issue is resolved till the code in Field is readable.
I don't really understand what you're suggesting here. In 3x where the deprecations will be occurring Field has to continue to extend AbstractField. Yes in 4.0 we can drop that extension but addressing the deprecations is not in the scope of 3x.
Yeah good call. I think implementing Iterable<Field> is best, but it will also require adding a count() method to Document since often people retrieve the List to get the number of fields.
Once FieldType is in, all the various metadata properties (isIndexed, isStored etc) will be moved to FieldType, leaving Field as what you suggest as FieldValue. Field will contain its type, boost, name, value. If we have Analyzers on FieldTypes, then we will be able to remove the TokenStream from Field.
Yeah but not in 3x unfortunately. As it stands people can retrieve the List of Fieldables via getFields() and add whatever implementation of Fieldable they like. Consequently we need to continue to support Fieldable in IW for example. Once this code has been committed I will create a new patch for trunk which moves all of Solr and Lucene over to the Field. I could do this in many places already of course, but that core classes like IW would have to remain as they are. I will wait for your thoughts on the reformating and then make a new patch. |
Simon Willnauer (@s1monw) (migrated from JIRA)
What I mean here is that if I would simply remove the extends AbstractField from Field would it still compile or are there any dependencies from AbstractField? IMO AbstractField should just be empty now right? |
Chris Male (migrated from JIRA) Yes Field would still compile if you removed the extends. However if we empty AbstractField then any client code that also extends AbstractField would break. Thats why I deprecate the whole class but leave its code in. We could empty it and change it to extend Field, I think that would still work. |
Simon Willnauer (@s1monw) (migrated from JIRA)
So, what is the reason for doing this in 3.x at all, can't we simply drop stuff in 4.0 and let 3.x alone? Simon |
Chris Male (migrated from JIRA)
Very good question. Certainly we are simplifying the codebase and I feel that Field is what most users use (not AbstractField). But I know some expert users do use AbstractField. But maybe they can handle the hard change? |
In order to move field type like functionality into its own class, we really need to try to tackle the hierarchy of Fieldable, AbstractField and Field. Currently AbstractField depends on Field, and does not provide much more functionality that storing fields, most of which are being moved over to FieldType. Therefore it seems ideal to try to deprecate AbstractField (and possible Fieldable), moving much of the functionality into Field and FieldType.
Migrated from LUCENE-2310 by Chris Male, resolved Mar 20 2012
Parent: #3384
Attachments: LUCENE-2310.patch, LUCENE-2310-Deprecate-AbstractField.patch (versions: 3), LUCENE-2310-Deprecate-AbstractField-CleanField.patch, LUCENE-2310-Deprecate-DocumentGetFields.patch (versions: 2), LUCENE-2310-Deprecate-DocumentGetFields-core.patch
The text was updated successfully, but these errors were encountered: