-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement standard Serialization across Lucene versions [LUCENE-1473] #2547
Comments
Jason Rutherglen (migrated from JIRA) LUCENE-1473.patch Term implements Externalizable. Added serialVersionUID handling in the read/writeExternal methods. The long encoding needs to be variable long encoded to reduce the size of the resulting serialized bytes. If it looks ok, I will implement Externalizable in other classes. |
Michael McCandless (@mikemccand) (migrated from JIRA) Do we really need to write the serialVersionUID? That's adding 8 bytes to the storage of each term. The term storage is not particularly efficient when storing many terms in the same field, because eg the String field is not written as an intern'd string. Also I see many tests failing with this, eg TestBoolean2 – I think we'll have to add: public Term() {} so deserialization can work? Which is then sort of annoying because it means it's possible to create a Term with null field & text (though, you can do that anyway by passing in "null" for each). |
Jason Rutherglen (migrated from JIRA) The serialVersionUID needs to be written if the class is going to evolve. It's written now, and currently in default serialization the field names are also written. We'll need empty constructors. |
Mark Miller (@markrmiller) (migrated from JIRA) I share Michaels concerns. Whats the motivation for core Lucene classes supporting serialization and is it strong enough to warrant these changes? It comes with a cost even without the mentioned annoyances right? (which are only for the current class, there may be more?) |
Jason Rutherglen (migrated from JIRA) Lucene supports serialization explicitly by implementing Serializable. |
Michael McCandless (@mikemccand) (migrated from JIRA)
Can we use a byte, not long, for serialVersionUID? And maybe change its name to SERIAL_VERSION? (I think serialVersionUID is completely unused once you implement Externalizable?). This brings up another question: what's our back compat policy here? For how many releases after you've serialized a Term can you still read it back? This is getting complicated... I'm wondering if we shouldn't even go here (ie, make any promise that something serialized in release X will be deserializable on release Y). I also think serialization is better done "at the top" where you can do a better job encoding things. EG this is the purpose of the TermsDict (to serialize many Terms, in sorted order). Jason, what's the big picture use case here; what are you serializing? |
Mark Miller (@markrmiller) (migrated from JIRA)
Right, but we don't really support it (like many/most I would guess). There is a pain in the butt cost of support. Since this patch seems to push that pain around, I'm just wondering if the motivation for it is worth the cost (not knowing the motivation). |
Jason Rutherglen (migrated from JIRA) Currently serialization between 2.3 and 2.4 is broken, backwards compatibility is broken. Saying "we don't really support it" means Serializable will not be implemented in any classes. This really simple Java stuff that I'm surprised is raising concern here. |
Doug Cutting (@cutting) (migrated from JIRA) The attached patch optimizes java serialization. Also, if we want java serialization to work cross-version, it gives us a leg to stand on. It doesn't change anything for most Lucene users, since Lucene doesn't use java serialization except for RMI. So, today, if you're using RemoteSearcher, and the local and remote versions have different versions of Term.java, things will probably fail, since, by default, the serialVersionUID is a hash of the method signatures and fields. If we want RemoteSearcher to work cross-version, then we need to explicitly manage the serialVersionUID and readExternal/writeExternal implementations. But is that really a priority? As with all optimizations, the performance improvement should be measured and demonstrated to be significant. Also, an invasive change like this is hard to justify when so little of Lucene depends on java serialization. |
John Wang (migrated from JIRA) the fact an object implements Serializable implies this object can be serialized. It is a known good java programming practice to include a suid to the class (as a static variable) when the object declares itself to be Serializable. If it is not meant to be serialized, why did it implement Serializable. Furthermore, what is the reason to avoid it being serialized? I find the reason being the cost of support kinda ridiculous, seems this reason can be applied to any bug fix, because this at the end of the day, it is a bug. I don't understand the issue of "extra bytes" to the term dictionary if the Term instance is not actually serialized to the index (at least I really hope that is not done) The serialVersionUID (suid) is a long because it is a java thing. Here is a link to some information on the subject: Use case: deploying lucene in a distributed environment, we have a broker/server architecture. (standard stuff), we want roll out search servers with lucene 2.4 instance by instance. The problem is that the broker is sending a Query object to the searcher via java serialization at the server level, and the broker is running 2.3. And because of specifically this problem, 2.3 brokers cannot to talk to 2.4 search servers even when the Query object was not changed. To me, this is a very valid use-case. The problem was two different people did the release with different compilers. At the risk of pissing off the Lucene powerhouse, I feel I have to express some candor. I am growing more and more frustrated with the lack of the open source nature of this project and its unwillingness to work with the developer community. This is a rather trivial issue, and it is taking 7 back-and-forth's to reiterate some standard Java behavior that has been around for years. Lucene is a great project and has enjoyed great success, and I think it is to everyone's interest to make sure Lucene grows in a healthy environment. |
Jason Rutherglen (migrated from JIRA) The Spring framework http://www.springframework.org/ is a good example of a widely used open source Java project that implements and uses Serialization in most of it's classes. If Serialization will not be fixed in Lucene then perhaps it's best to implement a serializable wrapper in the Spring project for Lucene. |
Michael McCandless (@mikemccand) (migrated from JIRA)
Whoa! I'm sorry if my questions are giving this impression. I don't But I do have real questions, still, because I don't think
OK that is a great use case – thanks. That helps focus the many
But that alone gives a too-fragile back-compat solution because it's Consider this actual example: RangeQuery, in 2.9, now separately {quote}
> The problem was two different people did the release with different compilers.
> If it is not meant to be serialized, why did it implement Serializable. Because there are two different things it can "mean" when a class The first meaning (let's call it "live serialization") is: "within the The second meaning (let's call it "long-term persistence") is: "you Lucene, today, only guarantees "live serialization", and that's the But, what's now being asked for (expected) with this issue is
These issues are the same issues we face with the index file format, |
robert engels (migrated from JIRA) I don't see why you can't just break compatibility between versions when dealing with Serialization. Just have it continue to mean live (or close to live) persistence. Even the JDK does this (e.g. Swing serialization makes no guarantees). Just do the same - bound to change between releases... Also, different compilers will generate different SUID... usually due to synthetic methods. It's kind of a problem... |
Chris M. Hostetter (@hossman) (migrated from JIRA) For the record: i have limited understanding of java serialization issues...
The developer community consists of hundreds (possibly thousands) of people, who participate at various levels. At the time the above quoted comment was posted, 4 members of the community had expressed an opinion on this issue: 1 clearly in favor, and 3 questioning the advantages and disadvantages as they affect the whole community, both in terms of the performance impacts for existing use cases, and the long term support issues that might come from a change like this. How anyone could consider these comments and questions "unwillingness to work with the developer community" blows my mind ... i do not see an overwhelming push by the community at large for a feature, i do not see a "Lucene powerhouse" arguing against the will of the masses ... I see two people arguing in favor of a change, and three questioning whether this change is a good idea (i'm not sure if i understand robert's post fully, i believe he's suggesting we maintain the status quo such that serialization is supported but no claims are made about back-compatible serialization). I would define that as healthy discussion. This, to me, seems to be the crux of the issue...
...this jives with even my limited experience with java serialization, and i share Michael's concerns. The current behavior does not appear to be a bug, it appears to be the correct effects of a serializable class changing between two versions w/o any explicit policy on serialization compatibility. The changes being discussed seem to be a request for a new "feature": a back-compat commitment on the serialization of one (or more) classes. However small the patch may be, it's still a significant change that should not be made lightly ... I certainly think all 4 of Michael's bullet points should be clearly addressed before committing anything, and I agree with Doug's earlier observation regarding performance: we need to test that a new serialization strategy won't adversely affect existing users who rely on (what Michael refered to as) "live serialization". This is my opinion. I voice it not as a member of any sort of mythical "Lucene powerhouse" but as member of the Lucene community who is is concerned about making sure that decisions are made towards "everyone's interest to make sure Lucene grows in a healthy environment." – not just the interests of two of vocal people. |
Jason Rutherglen (migrated from JIRA) > the performance improvement should be measured and demonstrated to be significant The initial concern was the incompatibility of serialized objects between Lucene versions. The performance improvements created by using Externalizable are secondary and so providing tests would be a waste of time if the "community" believes it is too much effort to add 1 line of code to a handful of classes. Implementing Externalizable is a way to reduce the size of the serialized objects, manage the serialized object versions, and provide performance improvements. Externalizable provides the most benefits and is very similar to the system Hadoop uses with Writeable. Externalizable works seamlessly with native object serialization and Serializable implemented classes, meaning it works with a number of existing Java classes in addition to Externalizable classes. Using distributed serialized objects for search in Lucene is a natural Java based way to run a Lucene system. In many cases it is ideal because Java provides something C++ does not, dynamic in-process class loading. In a large grid based search system that requires 100% uptime this feature can be particularly useful. Adding a serialVersionUID to the classes is one option, adding Externalizable is another option. If the decision is to not support Serialization in Lucene then I recommend removing Serializable from all classes in Lucene 3.0 so that users do not mistakenly expect the search library to behave the way other Java libraries such as ICU4J, JDK class libraries, Spring, etc do. |
Doug Cutting (@cutting) (migrated from JIRA) > But, what's now being asked for (expected) with this issue is "long-term persistence", which is really a very different beast and a much taller order. That's the crux, alright. Does Lucene want to start adding cross-version guarantees about the durability of its objects when serialized by Java serialization. This is a hard problem. Systems like Thrift and ProtocolBuffers offer support for this, but Java Serialiation itself doesn't really provide much assistance. One can roll one's own serialization compatibility story manually, as proposed by this patch, but that adds a burden to the project. We'd need, for example, test cases that keep serialized instances from past versions, so that we can be sure that patches do not break this. The use case provided may not use RMI, but it is similar: it involves transmitting Lucene objects over the wire between different versions of Lucene. Since Java APIs, like Lucene, do not generally provide cross-version compatibility, it would be safer to architect such a system so that it controls the serialization of transmitted instances itself and can thus guarantee their compatibility as the system is updated. Thus it would develop its own representations for queries independent of Lucene's Query, and map this to Lucene's Query. Is that not workable in this case? |
Doug Cutting (@cutting) (migrated from JIRA)
The "implements Serializeable" was added to support RemoteSearchable. If we believe this creates a bug, then perhaps we should remove this and implement RemoteSearchable in another way. As it stands, Lucene does not support Java Serialization across Lucene versions. That seems to me more like a limitation than a bug, no? Every line of code added to Lucene is a support burden, so we must carefully weigh the costs and benefits of each line. This issue proposes to add many lines, and to add a substantial new back-compatibility requirement. Back-compatibility is something that Lucene takes seriously. We make promises about both API back-compatibility and file-format back-compatibility. These already significantly constrain development. Adding a new back-compatibility requirement should not be done lightly, but only after broad consensus is reached through patient discussion. |
robert engels (migrated from JIRA) In regards to Doug's comment about an alternate form... doesn't SOLR already have a XML based query format? If so, just persist the queries using this. You will be immune to serialization changes (provided the SOLR parser remains backwards compatible). |
Mark Miller (@markrmiller) (migrated from JIRA)
There will be complaints no matter what. GWT tried getting around people having to implement Serializable by providing an interface with fewer promises: isSerizable. Many complained right away, as they had other classes that perhaps they where making Serializable simply for Apache XMLRpc or something. So now you can use either Serializable or isSerialiazble. Personally, I think its fine to do as we are. I'm not against supporting more though. If we choose not to go further (and from what I can tell that decision has not yet been made yet, against or for) add to the javadocs about what we support, as I don't think its a bug myself. The Serializable interface indicates that the class and its subclasses will be Serializable, my reading of the javadoc does not indicate what cross version compatibility must be supported. I believe that is up to the implementor.
|
Jason Rutherglen (migrated from JIRA) "This is a hard problem." I disagree. It's completely manageable. Doesn't Hadoop handle versioning inside of Writeable classes? ScoreDocComparator javadoc "sortValue(ScoreDoc i) Returns the value used to sort the given document. The object returned must implement the java.io.Serializable interface. This is used by multisearchers to determine how to collate results from their searchers." This kind of statement in the code leads one to believe that Lucene supports Serialization. Maybe it should be removed from the Javadocs. "Thrift and ProtocolBuffers" don't support dynamic class loading. If one were to create their own Query class with custom code, serializing is the only way to represent the Query object and have Java load the additional implementation code. One easy to see use case is if Analyzer were made Serializable then indexing over the network and trying different analyzing techniques could be accomplished with ease in a grid computing environment. "representations for queries independent of Lucene's Query, and map this to Lucene's Query. Is that not workable in this case?" Mike wrote "if we add field X to a class implementing Serializable, There needs to be "if statements" in readExternal to handle backwards compatibility. Given the number of classes, and the number of fields this isn't very much work. Neither are the test cases. I worked on RMI and Jini at Sun and elsewhere. I am happy to put forth the effort to maintain and develop this functionality. It is advantageous to place this functionality directly into the classes because in my experience many of the Lucene classes do not make all of the field data public, and things like dedicated serialization such as the XML query code are voluminous. Also the half support of serialization right now seems to indicate there really isn't support for it. Hoss wrote: "sort of mythical "Lucene powerhouse" Many developers seem to be working outside of Lucene and choosing not to open source in order to avoid going through the current hassles of getting code committed to the project. |
Jason Rutherglen (migrated from JIRA) "In regards to Doug's comment about an alternate form... doesn't SOLR already have a XML based query format? If so, just persist the queries using this. You will be immune to serialization changes (provided the SOLR parser remains backwards compatible)." SOLR does not have an XML based query format. XML is not ideal for distributed search because it is slow and verbose. There are many ways to serialize things, the issue is not in choosing one, but in supporting what most Java libraries do today which is native to the Java platform. |
Jason Rutherglen (migrated from JIRA) Mark: "There will be complaints no matter what. GWT tried getting around people having to implement Serializable by providing an interface with fewer promises: isSerizable. Many complained right away, as they had other classes that perhaps they where making Serializable simply for Apache XMLRpc or something. So now you can use either Serializable or isSerialiazble. Personally, I think its fine to do as we are. I'm not against supporting more though. " Externalizable and Serializable work interchangeably, a nice feature of Java. For classes that no one submits an Externalizable patch for, the serialVersionUID needs to be added. For ones that implement Externalizable, there is slightly more work, but not something someone with a year of Java experience can't maintain. |
Doug Cutting (@cutting) (migrated from JIRA) > Doesn't Hadoop handle versioning inside of Writeable classes? Currently, yes, but this probably insufficient for a Hadoop 1.0 release. Hadoop is a distributed system, and would like to provide RPC back-compatibility across minor versions after we go 1.0. This is an explicit decision that's in the process of discussion within the Hadoop project. RPC's within Hadoop daemons will probably require identical versions – all daemons in a cluster must be upgraded in lockstep. We'll thus probably limit back-compatibility to client RPC's, so that a given client can talk to multiple clusters that are not running identical versions of Hadoop. Lucene has made no such explicit policy decision. Lucene is not an inherently distributed system. Hadoop has not yet decided what mechanism to use to support back-compatible RPC, but Writable versioning is not sufficient, since it does not handle RPC protocols, and it's lower-level than we'd prefer. We'll probably go with something more like Thrift or ProtocolBuffers. Hadoop does not use Java serialization and makes no promises about that. > The developer contributions seem to be quite low right now, especially compared to neighbor projects such as Hadoop. As one who monitors both projects, I don't see a marked difference. In both there are sometimes patches that unfortunately languish, because, while they're important to the contributor, they fail to sufficiently engage a committer. For example, HADOOP-3422 took over 6 months to get committed, probably because not many committers use Ganglia. There is a difference in quantity: hadoop-dev has over 4x the traffic of lucene-dev. But, normalized for that, the number of patches from non-committers feels comparable. If anything I'd guess Lucene commits more patches from non-committers than does Hadoop. |
Doug Cutting (@cutting) (migrated from JIRA) The documentation should probably be fixed to state that Lucene's use of Serializeable currently assumes that all parties are using the exact same version of Lucene. That's the default for Serializeable, but it probably bears stating explicitly. Then we should decide, going forward, whether this should change, and, if so, for which classes and how. Such a policy should be agreed on before code is written, no? |
robert engels (migrated from JIRA) Jason, you are only partially correct. SOLR has XML definitions for updates/commits/deletes. It also supports string based queries. It also supports using XML for queries if you provide a handler, but the string syntax is simpler. As for the serialization performance, you are mistaken. For things like searches, the parsing time is extremely minor compared with the search time, so this additional overhead would be a fraction of the cost. When returning results sets, using XML can make a huge difference, as the overhead to paid on every item. Even still with modern XML processors, the search time is still going to be the overriding performance factor by a huge margin. Typically "paged" results are also used, so again, the XML parsing compared to the network overhead, is going to be minor. Still, if it is only for temporary serialization, binary works best - as long as it is Java to Java. We have a search server that uses Java serialization for the message passing, including the results. It can be done without any changes to Lucene - again the overriding performance factor is the search itself (unless the queries returns 100k + documents and all are being returned - then the Java serialization time can be more than the search itself... |
Jason Rutherglen (migrated from JIRA) Robert: That doesn't sound like query serialization. SOLR has a binary protocol due to criticisms about XML being slow. I'm not sure why you and Doug and focusing on performance when that is not really the main issue I brought up. Also I'm confused as to why dynamic classloading is being ignored by you folks as a Java feature that a Java search library could take advantage of to differentiate itself in the search (closed and open source) marketplace. |
robert engels (migrated from JIRA) The reason the XML is not needed, is because the string format is robust enough, and is simpler... I am not focused on performance. It is just that Java serialization works well for temporary persistence. Other formats are better for long-term persistence. If you are only doing temporary persistence, you don't need backwards Also if the API is exposed to the world (i.e. non-Java), a human readable (or even binary) format that is not based on Java serialization is going to work much better. if you use a non-binary protocol it is far easier to extend it in the future, and retain the ability to easily read older versions. This is why Swing uses XML for serialization, and not binary. You could certainly use specialized class loaders to load old versions of classes in order to maintain the ability to read old now incompatible classes... it is just a lot of work (maintenance too, need to keep the old code around... etc.) for not a lot of benefit. As for SOLR's binary protocol, fine, but it is probably for a fringe use case, or the submitter didn't do real world tests... The XML parsing is just not that much greater than binary (at least in Java, since it is the object creation they both use that affects it). The search time is going to be far greater. For large updates a binary loader can be more efficient that XML, but if you test it using real-world examples, I doubt you will see a huge difference - at least for the types of application TYPICALLY written using Lucene. |
Doug Cutting (@cutting) (migrated from JIRA) > I'm not sure why you and Doug and focusing on performance when that is not really the main issue I brought up. In the description of this issue you claim, "This will make Serialization faster". I only added that, if this is a motivation, then it should be benchmarked and quantified. If it's not a motivation, and cross-version compatibility is the only motivation, then that should be clarified. > dynamic classloading is being ignored by you folks Perhaps most Lucene developers are not using dynamic class loading, just as most Lucene developers seem not to be relying on java serialization. Perhaps you can convince them to start using it, but if not, you may need to find a way to get Lucene to support it that minimally impacts other users of Lucene. Adding methods to Term and Query that must be changed whenever these classes change adds a cost. If folks don't see much benefit, then that cost outweighs. Perhaps you can better enlighten us to the benefits rather than assert willful ignorance? Jini promised great things with dynamic class loading, but the list of folks that use Jini is not long (http://www.jini.org/wiki/Who_uses_Jini%3F). |
Jason Rutherglen (migrated from JIRA) The discussion has evolved out of scope. Cross-version compatibility is the main goal. We have multiple versions of Lucene using the Spring RPC protocol. The standard way to solve this is add a serialVersionUID like HashMap, String, and the other Java classes do. Performance is a topic of concern for Lucene users and existing RMI/Serialization users would transparently benefit by Externalizable being used. I would like to implement this as a separate project, however performing reflection on each of the objects is not an efficient approach. Writing wrapper code for each and every variation of Query is a waste of time when it implements Serializable and the communication is between Java systems. It seems best to remove Serialization from Lucene so that users are not confused and create a better solution. > Perhaps most Lucene developers are not using dynamic class loading, Dynamic classloading is popular and accepted in J2EE servers. If done properly it is a very convenient way of deploying Java based systems. Jini did not make this convenient enough. Jini did not have a specific problem it was trying to solve and so was too complex and open ended. Spring and J2EE make use of Java serialization for distributed objects. People may not be using Lucene for this today but this is due largely to lack of support for things like standard Serialization. With Lucene it is possible to make dynamic search classloading convenient in a search grid environment. When J2EE was designed, no one was using Java on the server side. A framework was composed and a handful of companies implemented the specification and then found it's way into projects. If you are looking for users to ask for something that does not exist like this, it will not happen. The interface Lucene exposes is relatively static and known. All of the top level classes, Query, Analyzer, Document, Similarity do not change very often. In a search based grid computing environment, the ability to execute arbitrary code against the cloud saves time and effort in deploying new code to servers. Restarting processes in a production environment is always expensive. If one is implementing for example mutating genetic algorithms using Java over Lucene then it would be advantageous to dynamically load the classes that implement this. They would be modifications of several classes such as Query, Similarity. It is peculiar all the effort that goes into backwards compatibility of the index, but for Serialization it is ignored. This is and will be very confusing to users, especially ones who use J2EE. |
Mark Harwood (@markharwood) (migrated from JIRA) The contrib section of Lucene contains an XML-based query parser which aims to provide full-coverage of Lucene queries/filters and provide extensibility to support 3rd party classes. Maintaining XML parsing code is an overhead but ultimately helps decouple requests from the logic that executes requests. In serializing Lucene Query/Filter objects we are dealing with the classes which combine both the representation of the request criteria (what needs to be done) and the implementation (how things are done). We are forever finessing the "how" bit of this equation e.g. moving from RangeQuery to RangeFilters to TrieRangeFilter. The criteria however remains relatively static (" I just want to search on a range") and so it is dangerous to build clients that refer tdirectly to query implementation classes. Cheers |
Michael McCandless (@mikemccand) (migrated from JIRA)
I don't think that's the case. If we choose to only support "live serialization" then we should add "implements Serializable" but spell out clearly in the javadocs that there is no guarantee of cross-version compatibility ("long term persistence") and in fact that often there are incompatibilities. I think "live serialization" is still a useful feature. |
Michael McCandless (@mikemccand) (migrated from JIRA)
The serialVersionUID approach would be too simplistic, because we can't simply bump it up whenever we make a change since that then breaks back compatibility. We would have to override write/readObject or write/readExternal, and serialVersionUID would not be used. |
John Wang (migrated from JIRA) Mike:
-John |
Doug Cutting (@cutting) (migrated from JIRA) > Serialization between VM1 and VM2 of class A is ok, just that A will not get the new fields. Which is fine since VM1 does not make use of it. But VM1 might require an older field that the new field replaced, and VM1 may then crash in an unpredictable way. Not defining explicit suid's is more conservative: you get a well-defined exception when things might not work. Defining suid's but doing nothing else about compatibility is playing fast-and-loose: it might work in many cases, but it also might cause strange, hard-to-diagnose problems in others. If we want Lucene to work reliably across versions, then we need to commit to that goal as a project, define the limits of the compatibility, implement Externalizeable, add tests, etc. Just adding suid's doesn't achieve that, so far as I can see. |
robert engels (migrated from JIRA) Even if you changed SUIDs based on version changes, there is the very real possibility that the new code CAN'T be instantiated in any meaningful way from the old data. Then what would you do? Even if you had all of the old classes, and their dependencies available from dynamic classloading, it still won't work UNLESS every new feature is designed with backwards compatibility with previous versions - a burden that is just too great when required of all Lucene code. Given that, as has been discussed, there are other formats that can be used where isolated backwards persistence is desired (like XML based query descriptions). Even these won't work if the XML description references explicit classes - which is why designing such a format for a near limitless query structure (given user defined query classes) is probably impossible. So strive for a decent solution that covers most cases, and fails gracefully when it can't work. using standard serialization (with proper transient fields) seems to fit this bill, since in a stable API, most core classes should remain fairly constant, and those that are bound to change may take explicit steps in their serialization (if deemed needed) |
John Wang (migrated from JIRA) The discussion here is whether it is better to have 100% of the time failing vs. 10% of the time failing. (these are just meaningless numbers to express a point) If it is documented that lucene only supports serialization of classes from the same jar, is that really enough, doesn't it also depend on the compiler, if someone were to build their own jar? Furthermore, in a distributed environment with lotsa machines, it is always idea to upgrade bit by bit, is taking this functionality away by imposing this restriction a good trade-off to just implementing Externalizable for a few classes, if Serializable is deemed to be dangerous, which I am not so sure given the lucene classes we are talking about. |
Doug Cutting (@cutting) (migrated from JIRA) > This discussion has digressed to general Java serialization design, where it originally scoped only to several lucene classes. Which classes? The existing patch applies to one class. Jason said, "If it looks ok, I will implement Externalizable in other classes." but never said which. It would be good to know how wide the impact of the proposed change would be. |
John Wang (migrated from JIRA) For our problem, it is Query all all its derived and encapsulated classes. I guess the title of the bug is too generic. As far as my comment about other lucene classes, one can just go to the lucene javadoc and click on "Tree" and look for Serializable. If you want me to, I can go an fetch the complete list, but here are some examples:
For the top level API. |
Jason Rutherglen (migrated from JIRA) LUCENE-1473.patch serialVersionUID added to the relevant classes manually. Defaulted to 10 because it does not matter, as long it is different between versions. Thought of writing some code to go through the Lucene JAR, do an instanceof on the classes for Serializable and then verify that the serialVersionUID is 10. Term implements Externalizable. SerializationUtils was adapted from WriteableUtils of Hadoop for writing VLong. TestSerialization use case does term serialization and serializes an arbitrary query to a file and compares them. TODO:
|
Michael McCandless (@mikemccand) (migrated from JIRA) SerializeUtils is missing from the patch. |
Doug Cutting (@cutting) (migrated from JIRA) > How to write a unit test for multiple versions? We can save, in files, serialized instances of each query type from the oldest release we intend to support. Then read each of thes queries and check that it s equal to a current query that's meant to be equivalent (ssuming all queries implement equals well). Something similar would need to be done for each class that is meant to be transmitted cross-version. This tests that older queries may be processed by newer code. It does not test that newer queries can be processed by older code. Documentation is a big part of this effort, that should be completed first. What guarantees to we intend to provide? Once we've documented these, then we can begin writing tests. For example, we may only guarantee that older queries work with newer code, and that newer hits work with older code. To test that we'd need to have an old jar around that we could test against. This will be a trickier test to configure. |
Jason Rutherglen (migrated from JIRA) LUCENE-1473.patch Added Externalizable to Document, Field, AbstractField (as compared to the previous patch). SerializationUtils is included. TODO:
|
Jason Rutherglen (migrated from JIRA) Doug wrote: We can save, in files, serialized instances of each query type from the oldest release we intend to support. Then read each of thes queries and check that it s equal to a current query that's meant to be equivalent (ssuming all queries implement equals well). Something similar would need to be done for each class that is meant to be transmitted cross-version. This tests that older queries may be processed by newer code. It does not test that newer queries can be processed by older code. Documentation is a big part of this effort, that should be completed first. What guarantees to we intend to provide? Once we've documented these, then we can begin writing tests. For example, we may only guarantee that older queries work with newer code, and that newer hits work with older code. To test that we'd need to have an old jar around that we could test against. This will be a trickier test to configure. Makes sense. I guarantee 2.9 and above classes will be backward compatible with the previous classes. I think that for 3.0 we'll start to create new replacement classes that will not conflict with the old classes. I'd really like to redesign the query, similarity, and scoring code to work with flexible indexing and allow new algorithms. This new code will not create changes in the existing query, similarity, and scoring code which will remain serialization compatible with 2.9. The 2.9 query, similarity, and scoring should leverage the new query, similarity and scoring code to be backwards compatible. |
Doug Cutting (@cutting) (migrated from JIRA) > I guarantee 2.9 and above classes will be backward compatible with the previous classes. It sounds like you are personally guaranteeing that all serializeable classes will be forever compatible. That's not what we'd need. We'd need a proposed policy for the project to consider in terms of major and minor releases, specifying forward and/or backward compatibility guarantees. For example, we might say, "within a major release cycle, serialized queries from older releases will work with newer releases, however serialized queries from newer releases will not generally work with older releases, since we might add new kinds of queries in the course of a major release cycle". Similarly detailed statements would need to be made for each Externalizeable, no? |
Jason Rutherglen (migrated from JIRA) LUCENE-1473.patch Added some more Externalizables. o.a.l.util.Parameter is peculiar in that it implements readResolve to override the serialization and return a local object to emulate enums. I haven't figured out the places this is used and what the best approach is to externalize them. TODO:
Doug wrote: ""within a major release cycle, serialized queries from older releases will work with newer releases, however serialized queries from newer releases will not generally work with older releases, since we might add new kinds of queries in the course of a major release cycle". Similarly detailed statements would need to be made for each Externalizeable, no?" Serialized objects in minor releases will work. Serialized objects of older versions starting with 2.9 will be compatible with newer versions. New versions will be compatible with older versions on a classes by class basis defined in the release notes. It could look something like this: Serialization notes: |
Michael Busch (migrated from JIRA) I really wouldn't want to add another backwards-compatibility But maybe we can come up with a fair compromise here. What if we public abstract class CustomExternalizableReader {
public abstract void readExternal(Object obj, ObjectInput in)
throws IOException, ClassNotFoundException;
} add a package-private, static variable of this type to a class that public class SomeClass implements Externalizable {
private int one;
private int two;
...
static CustomExternalizableReader extReader = new CustomExternalizableReader() {
public void readExternal(Object obj, ObjectInput in) throws IOException,
ClassNotFoundException {
SomeClass s = (SomeClass) obj;
long uid = in.readLong();
if (uid != serialVersionUID) {
throw new IOException("Wrong serialVerionUID: " + uid);
}
int one = in.readInt();
int two = in.readInt();
s.init(one, two);
}
};
// initialization method for readExternal
void init(int one, int two) {
this.one = one;
this.two = two;
} Note that I also specified an init() method. Since both init() and Now if in the next version of this class we add a new variable 'three' public class SomeClassNewVersion implements Externalizable {
private int one;
private int two;
private int three;
static final long serialVersionUID = 2L;
public void readExternal(ObjectInput in) throws IOException,
ClassNotFoundException {
extReader.readExternal(this, in);
}
public void writeExternal(ObjectOutput out) throws IOException {
out.writeLong(serialVersionUID);
out.writeInt(one);
out.writeInt(two);
out.writeInt(three);
}
/**
* This reader can only read the externalized format created with the same
* version of this class. If backwards-compatibility is desired, a custom
* reader has to be implemented.
*/
static CustomExternalizableReader extReader = new CustomExternalizableReader() {
public void readExternal(Object obj, ObjectInput in) throws IOException,
ClassNotFoundException {
SomeClassNewVersion s = (SomeClassNewVersion) obj;
long uid = in.readLong();
if (uid != serialVersionUID) {
throw new IOException("Wrong serialVerionUID: " + uid);
}
int one = in.readInt();
int two = in.readInt();
int three = in.readInt();
s.init(one, two, three);
}
};
void init(int one, int two, int three) {
this.one = one;
this.two = two;
this.three = three;
} Now if someone tries to deserialize an object that was written with But the user can simply implement an own, backwards-compatible reader: // Now the user implements their own backwards compatible reader
SomeClassNewVersion.extReader = new CustomExternalizableReader() {
public void readExternal(Object obj, ObjectInput in) throws IOException,
ClassNotFoundException {
SomeClassNewVersion c_new = (SomeClassNewVersion) obj;
long uid = in.readLong();
int one = in.readInt();
int two = in.readInt();
int three;
if (uid == 1) {
// old version - initialze with default value
three = -3;
} else {
// new version
three = in.readInt();
}
c_new.init(one, two, three);
}
}; With this approach we have to clearly document in the javadocs the The nice thing here is that we allow backwards-compatibility, but push I coded this all up as an example that I'm attaching here. Let me know |
Michael McCandless (@mikemccand) (migrated from JIRA)
It very much still is, as I'm learning with #2532! Your first example is missing the read/writeExternal methods. I think the proposed approach is rather heavy-weight – we will have I guess my feeling is all of that is a good amount more work than just EG, to do your example in a future world where we do support back First a util class for managing versions: public class Versions {
private int current;
int add(String desc) {
// TODO: do something more interesting with desc
return current++;
}
void write(ObjectOutput out) throws IOException {
// TODO: writeVInt
out.writeByte((byte) current);
}
void read(ObjectInput in) throws IOException {
// TODO: readVInt
final byte version = in.readByte();
if (version > current)
throw new IOException("this object was serialized by a newer version of Lucene (got " + version + " but expected <= " + current + ")");
}
} Then, someone creates SomeClass: public class SomeClass implements Externalizable {
private int one;
private int two;
private static final Versions versions = new Versions();
private static final int VERSION0 = versions.add("start");
public SomeClass() {};
public void writeExternal(ObjectOutput out) throws IOException {
versions.write(out);
out.writeInt(one);
out.writeInt(two);
}
public void readExternal(ObjectInput in) throws IOException {
versions.read(in);
one = in.readInt();
two = in.readInt();
}
...
} Then on adding field three: public class SomeClass implements Externalizable {
private int one;
private int two;
private int three;
private static final Versions versions = new Versions();
private static final int VERSION0 = versions.add("start");
private static final int VERSION1 = versions.add("the new field three");
public SomeClass() {};
public void writeExternal(ObjectOutput out) throws IOException {
versions.write(out);
out.writeInt(one);
out.writeInt(two);
}
public void readExternal(ObjectInput in) throws IOException {
int version = versions.read(in);
one = in.readInt();
two = in.readInt();
if (version >= VERSION1)
three = in.readInt();
else
// default
three = -3;
}
...
} In fact I think we should switch to Versions utils class for writing/reading our index files... |
Michael Busch (migrated from JIRA)
Oups, I forgot to copy&paste it. It's in the attached patch file though.
Really? In case we go the Externalizable way anyway, then I think this And, the code changes are straightforward when the class changes. No What I think will be the most work is documenting and explaining But for the record: I'm totally fine with using Serializable and just |
Doug Cutting (@cutting) (migrated from JIRA) Would it take any more lines of code to remove Serializeable from the core classes and re-implement RemoteSearchable in a separate layer on top of the core APIs? That layer could be a contrib module and could get all the externalizeable love it needs. It could support a specific popular subset of query and filter classes, rather than arbitrary Query implementations. It would be extensible, so that if folks wanted to support new kinds of queries, they easily could. This other approach seems like a slippery slope, complicating already complex code with new concerns. It would be better to encapsulate these concerns in a layer atop APIs whose back-compatibility we already make promises about, no? |
Wolf Siberski (migrated from JIRA) This seems to be the right way to go. The patch attached removes all dependencies to Serializable and Remote from the core and moves it to contrib/remote. I introduced a new interface RemoteSearcher |
Wolf Siberski (migrated from JIRA) This patch removes all dependencies to Serializable and Remote from the core and adds contrib/remote as replacement |
Mark Miller (@markrmiller) (migrated from JIRA) Thanks Wolf, +1 on the change. This issue proposes to do the same thing: #2481 |
Doug Cutting (@cutting) (migrated from JIRA) Thanks, Wolf, this looks like a promising approach. Jason, John: would this sort of thing meet your needs? I'm not sure we can remove everything from trunk immediately. Rather we should deprecate things and remove them in 3.0. The removal of Serializeable will break compatibility, so must be well-advertised. HitCollector-based search should simply not be supported in distributed search. The Searchable API was designed for remote use and does not include HitCollector-based access. Weighting, and hence ranking, does not appear to be implemented correctly by this patch. An approach that might work would be to:
|
Jason Rutherglen (migrated from JIRA) To Wolf: Your patch looked like it was quite a bit of work, nice job! Restricting people to XML will probably not be suitable though. Some may want JSON or something that more directly encodes the objects. General: Deprecating serialization entirely needs to be taken to the java-user mailing list as there are quite a number of installations relying on it. If this is something that overlaps with SOLR then it would be good for the SOLR folks to separate it out as a serialization library that could be used outside of the SOLR server. This would be a good idea for most of the SOLR functionality otherwise there would seem to be redundant development occurring. I'll finish up the Externalizable patch once #2391 is completed (IndexReader.clone) as it is something that needs feedback and testing to ensure it's workable for 2.9, whereas Externalizable is somewhat easier. |
Doug Cutting (@cutting) (migrated from JIRA) > shift the problem around but do not really solve the underlying issues That's the idea, actually, to shift it out of the core into contrib. We could use Externalizeable there, with no XML. > Deprecating serialization entirely needs to be taken to the java-user mailing list as there are quite a number of installations relying on it. No, we make decisions on the java-dev mailing list. Also, it won't go away, folks might just have to update their code to use different APIs if and when when they upgrade to 3.0. |
Wolf Siberski (migrated from JIRA) Thanks to Doug and Jason for your constructive feedback. Let me first clarify the purpose and scope of the patch. IMHO, the discussion about Serialization in Lucene is not clear-cut at all. My opinion is that moving all distribution-related code out of the core leads to a cleaner separation of concerns and thus is better design. On the other hand with removing Serializable we limit the Lucene application space at least a bit (e.g., no support for dynamic class loading), and abandon the advantages default Java serialization offers. Therefore the patch is to be taken as contribution to explore the design space (as Michaels patch on custom readers explored the Serializable option), and not as a full-fledged solution proposal. > [Doug] The removal of Serializeable will break compatibility, so must be well-advertised. > [Doug] The Searchable API was designed for remote use and does not include HitCollector-based access. > [Doug] Weighting, and hence ranking, does not appear to be implemented correctly by this patch. > [Jason] Restricting people to XML will probably not be suitable though. > [Jason] It seems the alternative solutions to serialization simply shift the problem around but do not really solve With respect to speed, I don't think that encoding/decoding is a significant performance factor in distributed search, but this would need to be benchmarked. With respect to versioning, my patch still keeps all options open. What is more important, Lucene users can now decide if they need compatibility between different versions, and roll their own encoding/decoding if they need it. Of course, if they are willing to contribute and maintain custom serializers which preserve back compatibility, they can do it in contrib as well as they could have done it in the core. Custom serialization is still possible although the standard Java serialization framework can't be used anymore for that purpose, and I admit that this is a disadvantage. |
Doug Cutting (@cutting) (migrated from JIRA) > Therefore the patch is to be taken as contribution to explore the design space [ ... ] Yes, and it is much appreciated for that. Thanks again! > Currently Searchable does include a HitCollector-based search method [ ... ] You're right. I misremembered. This dates back to the origin of Searchable. http://svn.apache.org/viewvc?view=rev&revision=149813 Personally, I think it would be reasonable for a distributed implementation to throw an exception if one tries to use a HitCollector. > We could either solve it along the line you propose, or revert to pass the Weight again instead of the Query. Without using an introspection-based serialization like Java serialization it would be difficult to pass a Weight over the wire using public APIs, since most implementations are not public. But, since Weight's are constructed via a standard protocol, the method I outlined could work. |
Jason Rutherglen (migrated from JIRA) Won't be working on these and they're old |
To maintain serialization compatibility between Lucene versions, serialVersionUID needs to be added to classes that implement java.io.Serializable. java.io.Externalizable may be implemented in classes for faster performance.
Migrated from LUCENE-1473 by Jason Rutherglen, 1 vote, resolved Jan 24 2011
Attachments: custom-externalizable-reader.patch, LUCENE-1473.patch (versions: 4), lucene-contrib-remote.patch
The text was updated successfully, but these errors were encountered: