Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement standard Serialization across Lucene versions [LUCENE-1473] #2547

Closed
asfimport opened this issue Dec 2, 2008 · 60 comments
Closed

Comments

@asfimport
Copy link

To maintain serialization compatibility between Lucene versions, serialVersionUID needs to be added to classes that implement java.io.Serializable. java.io.Externalizable may be implemented in classes for faster performance.


Migrated from LUCENE-1473 by Jason Rutherglen, 1 vote, resolved Jan 24 2011
Attachments: custom-externalizable-reader.patch, LUCENE-1473.patch (versions: 4), lucene-contrib-remote.patch

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

LUCENE-1473.patch

Term implements Externalizable. Added serialVersionUID handling in the read/writeExternal methods. The long encoding needs to be variable long encoded to reduce the size of the resulting serialized bytes.

If it looks ok, I will implement Externalizable in other classes.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Do we really need to write the serialVersionUID? That's adding 8 bytes to the storage of each term.

The term storage is not particularly efficient when storing many terms in the same field, because eg the String field is not written as an intern'd string.

Also I see many tests failing with this, eg TestBoolean2 – I think we'll have to add:

public Term() {}

so deserialization can work? Which is then sort of annoying because it means it's possible to create a Term with null field & text (though, you can do that anyway by passing in "null" for each).

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

The serialVersionUID needs to be written if the class is going to evolve. It's written now, and currently in default serialization the field names are also written. We'll need empty constructors.

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

I share Michaels concerns. Whats the motivation for core Lucene classes supporting serialization and is it strong enough to warrant these changes? It comes with a cost even without the mentioned annoyances right? (which are only for the current class, there may be more?)

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Lucene supports serialization explicitly by implementing Serializable.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

The serialVersionUID needs to be written if the class is going to evolve.

Can we use a byte, not long, for serialVersionUID? And maybe change its name to SERIAL_VERSION? (I think serialVersionUID is completely unused once you implement Externalizable?).

This brings up another question: what's our back compat policy here? For how many releases after you've serialized a Term can you still read it back? This is getting complicated... I'm wondering if we shouldn't even go here (ie, make any promise that something serialized in release X will be deserializable on release Y).

I also think serialization is better done "at the top" where you can do a better job encoding things. EG this is the purpose of the TermsDict (to serialize many Terms, in sorted order).

Jason, what's the big picture use case here; what are you serializing?

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

Lucene supports serialization explicitly by implementing Serializable.

Right, but we don't really support it (like many/most I would guess). There is a pain in the butt cost of support. Since this patch seems to push that pain around, I'm just wondering if the motivation for it is worth the cost (not knowing the motivation).

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Currently serialization between 2.3 and 2.4 is broken, backwards compatibility is broken.

Saying "we don't really support it" means Serializable will not be implemented in any classes.

This really simple Java stuff that I'm surprised is raising concern here.

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

The attached patch optimizes java serialization. Also, if we want java serialization to work cross-version, it gives us a leg to stand on. It doesn't change anything for most Lucene users, since Lucene doesn't use java serialization except for RMI. So, today, if you're using RemoteSearcher, and the local and remote versions have different versions of Term.java, things will probably fail, since, by default, the serialVersionUID is a hash of the method signatures and fields. If we want RemoteSearcher to work cross-version, then we need to explicitly manage the serialVersionUID and readExternal/writeExternal implementations. But is that really a priority?

As with all optimizations, the performance improvement should be measured and demonstrated to be significant. Also, an invasive change like this is hard to justify when so little of Lucene depends on java serialization.

@asfimport
Copy link
Author

John Wang (migrated from JIRA)

the fact an object implements Serializable implies this object can be serialized. It is a known good java programming practice to include a suid to the class (as a static variable) when the object declares itself to be Serializable. If it is not meant to be serialized, why did it implement Serializable. Furthermore, what is the reason to avoid it being serialized? I find the reason being the cost of support kinda ridiculous, seems this reason can be applied to any bug fix, because this at the end of the day, it is a bug.

I don't understand the issue of "extra bytes" to the term dictionary if the Term instance is not actually serialized to the index (at least I really hope that is not done)

The serialVersionUID (suid) is a long because it is a java thing. Here is a link to some information on the subject:
http://java.sun.com/developer/technicalArticles/Programming/serialization/

Use case: deploying lucene in a distributed environment, we have a broker/server architecture. (standard stuff), we want roll out search servers with lucene 2.4 instance by instance. The problem is that the broker is sending a Query object to the searcher via java serialization at the server level, and the broker is running 2.3. And because of specifically this problem, 2.3 brokers cannot to talk to 2.4 search servers even when the Query object was not changed.

To me, this is a very valid use-case. The problem was two different people did the release with different compilers.

At the risk of pissing off the Lucene powerhouse, I feel I have to express some candor. I am growing more and more frustrated with the lack of the open source nature of this project and its unwillingness to work with the developer community. This is a rather trivial issue, and it is taking 7 back-and-forth's to reiterate some standard Java behavior that has been around for years.

Lucene is a great project and has enjoyed great success, and I think it is to everyone's interest to make sure Lucene grows in a healthy environment.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

The Spring framework http://www.springframework.org/ is a good example of a widely used open source Java project that implements and uses Serialization in most of it's classes. If Serialization will not be fixed in Lucene then perhaps it's best to implement a serializable wrapper in the Spring project for Lucene.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

> At the risk of pissing off the Lucene powerhouse, I feel I have to express some candor. I am growing more and more frustrated with the lack of the open source nature of this project and its unwillingness to work with the developer community. This is a rather trivial issue, and it is taking 7 back-and-forth's to reiterate some standard Java behavior that has been around for years.

Whoa! I'm sorry if my questions are giving this impression. I don't
intend to.

But I do have real questions, still, because I don't think
Serialization is actually so simple. I too was surprised on looking
at what started as a simple patch yet on digging into it uncovered
some real challenges.

>Use case: deploying lucene in a distributed environment, we have a broker/server architecture. (standard stuff), we want roll out search servers with lucene 2.4 instance by instance. The problem is that the broker is sending a Query object to the searcher via java serialization at the server level, and the broker is running 2.3. And because of specifically this problem, 2.3 brokers cannot to talk to 2.4 search servers even when the Query object was not changed.

OK that is a great use case – thanks. That helps focus the many
questions here.

> It is a known good java programming practice to include a suid to the class (as a static variable) when the object declares itself to be Serializable.

But that alone gives a too-fragile back-compat solution because it's
too coarse. If we add field X to a class implementing Serializable,
and must bump the SUID, that's a hard break on back compat. So really
we need to override read/writeObject() or read/writeExternal() to do
our own versioning.

Consider this actual example: RangeQuery, in 2.9, now separately
stores "boolean includeLower" and "boolean includeUpper". In versions
<= 2.4, it only stores "boolean inclusive". This means we can't rely
on the JVM's default versioning for serialization.

{quote}
> The serialVersionUID (suid) is a long because it is a java thing.

But, that's only if you rely on the JVM's default serialization. If
we implement our own (overriding read/writeObject or
read/writeExtenral) we don't have to use "long SUID".

> The problem was two different people did the release with different compilers.

I think it's more likely the addition of a new ctor to Term (that
takes only String field), that changed the SUID.

> If it is not meant to be serialized, why did it implement Serializable.
{quote}

Because there are two different things it can "mean" when a class
implements Serializable, and I think that's the core
disconnect/challenge to this issue.

The first meaning (let's call it "live serialization") is: "within the
same version of Lucene you can serialize/deserialize this object".

The second meaning (let's call it "long-term persistence") is: "you
can serialize this object in version X of Lucene and later deserialize
it using a newer version Y of Lucene".

Lucene, today, only guarantees "live serialization", and that's the
intention when "implements Serializable" is added to a class.

But, what's now being asked for (expected) with this issue is
"long-term persistence", which is really a very different beast and a
much taller order. With it comes a number of challenges, that warrant
scrutiny:

  • What's our back-compat policy for "long-term persistence"?

  • The storage protocol must have a version header, so future changes
    can switch on that and decode older formats.

  • We need strong test cases that deserialize older versions of these
    serialized classes so we don't accidentally break it.

  • We should look carefully at the protocol and not waste bytes if we
    can (1 byte vs 8 byte version header).

These issues are the same issues we face with the index file format,
because that is also long-term persistence.

@asfimport
Copy link
Author

robert engels (migrated from JIRA)

I don't see why you can't just break compatibility between versions when dealing with Serialization. Just have it continue to mean live (or close to live) persistence.

Even the JDK does this (e.g. Swing serialization makes no guarantees). Just do the same - bound to change between releases...

Also, different compilers will generate different SUID... usually due to synthetic methods. It's kind of a problem...

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

For the record: i have limited understanding of java serialization issues...

At the risk of pissing off the Lucene powerhouse, I feel I have to express some candor. I am growing more and more frustrated with the lack of the open source nature of this project and its unwillingness to work with the developer community.

The developer community consists of hundreds (possibly thousands) of people, who participate at various levels. At the time the above quoted comment was posted, 4 members of the community had expressed an opinion on this issue: 1 clearly in favor, and 3 questioning the advantages and disadvantages as they affect the whole community, both in terms of the performance impacts for existing use cases, and the long term support issues that might come from a change like this.

How anyone could consider these comments and questions "unwillingness to work with the developer community" blows my mind ... i do not see an overwhelming push by the community at large for a feature, i do not see a "Lucene powerhouse" arguing against the will of the masses ... I see two people arguing in favor of a change, and three questioning whether this change is a good idea (i'm not sure if i understand robert's post fully, i believe he's suggesting we maintain the status quo such that serialization is supported but no claims are made about back-compatible serialization).

I would define that as healthy discussion.

This, to me, seems to be the crux of the issue...

Lucene, today, only guarantees "live serialization", and that's the
intention when "implements Serializable" is added to a class.

But, what's now being asked for (expected) with this issue is
"long-term persistence", which is really a very different beast and a
much taller order. With it comes a number of challenges, that warrant
scrutiny:

...this jives with even my limited experience with java serialization, and i share Michael's concerns. The current behavior does not appear to be a bug, it appears to be the correct effects of a serializable class changing between two versions w/o any explicit policy on serialization compatibility. The changes being discussed seem to be a request for a new "feature": a back-compat commitment on the serialization of one (or more) classes. However small the patch may be, it's still a significant change that should not be made lightly ... I certainly think all 4 of Michael's bullet points should be clearly addressed before committing anything, and I agree with Doug's earlier observation regarding performance: we need to test that a new serialization strategy won't adversely affect existing users who rely on (what Michael refered to as) "live serialization".

This is my opinion. I voice it not as a member of any sort of mythical "Lucene powerhouse" but as member of the Lucene community who is is concerned about making sure that decisions are made towards "everyone's interest to make sure Lucene grows in a healthy environment." – not just the interests of two of vocal people.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

> the performance improvement should be measured and demonstrated to be significant

The initial concern was the incompatibility of serialized objects between Lucene versions. The performance improvements created by using Externalizable are secondary and so providing tests would be a waste of time if the "community" believes it is too much effort to add 1 line of code to a handful of classes. Implementing Externalizable is a way to reduce the size of the serialized objects, manage the serialized object versions, and provide performance improvements. Externalizable provides the most benefits and is very similar to the system Hadoop uses with Writeable. Externalizable works seamlessly with native object serialization and Serializable implemented classes, meaning it works with a number of existing Java classes in addition to Externalizable classes.

Using distributed serialized objects for search in Lucene is a natural Java based way to run a Lucene system. In many cases it is ideal because Java provides something C++ does not, dynamic in-process class loading. In a large grid based search system that requires 100% uptime this feature can be particularly useful.

Adding a serialVersionUID to the classes is one option, adding Externalizable is another option.

If the decision is to not support Serialization in Lucene then I recommend removing Serializable from all classes in Lucene 3.0 so that users do not mistakenly expect the search library to behave the way other Java libraries such as ICU4J, JDK class libraries, Spring, etc do.

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

> But, what's now being asked for (expected) with this issue is "long-term persistence", which is really a very different beast and a much taller order.

That's the crux, alright. Does Lucene want to start adding cross-version guarantees about the durability of its objects when serialized by Java serialization. This is a hard problem. Systems like Thrift and ProtocolBuffers offer support for this, but Java Serialiation itself doesn't really provide much assistance. One can roll one's own serialization compatibility story manually, as proposed by this patch, but that adds a burden to the project. We'd need, for example, test cases that keep serialized instances from past versions, so that we can be sure that patches do not break this.

The use case provided may not use RMI, but it is similar: it involves transmitting Lucene objects over the wire between different versions of Lucene. Since Java APIs, like Lucene, do not generally provide cross-version compatibility, it would be safer to architect such a system so that it controls the serialization of transmitted instances itself and can thus guarantee their compatibility as the system is updated. Thus it would develop its own representations for queries independent of Lucene's Query, and map this to Lucene's Query. Is that not workable in this case?

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

If it is not meant to be serialized, why did it implement Serializable. Furthermore, what is the reason to avoid it being serialized? I find the reason being the cost of support kinda ridiculous, seems this reason can be applied to any bug fix, because this at the end of the day, it is a bug.

The "implements Serializeable" was added to support RemoteSearchable. If we believe this creates a bug, then perhaps we should remove this and implement RemoteSearchable in another way. As it stands, Lucene does not support Java Serialization across Lucene versions. That seems to me more like a limitation than a bug, no?

Every line of code added to Lucene is a support burden, so we must carefully weigh the costs and benefits of each line. This issue proposes to add many lines, and to add a substantial new back-compatibility requirement. Back-compatibility is something that Lucene takes seriously. We make promises about both API back-compatibility and file-format back-compatibility. These already significantly constrain development. Adding a new back-compatibility requirement should not be done lightly, but only after broad consensus is reached through patient discussion.

@asfimport
Copy link
Author

robert engels (migrated from JIRA)

In regards to Doug's comment about an alternate form... doesn't SOLR already have a XML based query format?

If so, just persist the queries using this. You will be immune to serialization changes (provided the SOLR parser remains backwards compatible).

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

The "implements Serializeable" was added to support RemoteSearchable. If we believe this creates a bug, then perhaps we should remove this and implement RemoteSearchable in another way. As it stands, Lucene does not support Java Serialization across Lucene versions. That seems to me more like a limitation than a bug, no?

There will be complaints no matter what. GWT tried getting around people having to implement Serializable by providing an interface with fewer promises: isSerizable. Many complained right away, as they had other classes that perhaps they where making Serializable simply for Apache XMLRpc or something. So now you can use either Serializable or isSerialiazble.

Personally, I think its fine to do as we are. I'm not against supporting more though.

If we choose not to go further (and from what I can tell that decision has not yet been made yet, against or for) add to the javadocs about what we support, as I don't think its a bug myself. The Serializable interface indicates that the class and its subclasses will be Serializable, my reading of the javadoc does not indicate what cross version compatibility must be supported. I believe that is up to the implementor.

  • Mark

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

"This is a hard problem."

I disagree. It's completely manageable. Doesn't Hadoop handle versioning inside of Writeable classes?

ScoreDocComparator javadoc "sortValue(ScoreDoc i) Returns the value used to sort the given document. The object returned must implement the java.io.Serializable interface. This is used by multisearchers to determine how to collate results from their searchers."

This kind of statement in the code leads one to believe that Lucene supports Serialization. Maybe it should be removed from the Javadocs.

"Thrift and ProtocolBuffers" don't support dynamic class loading. If one were to create their own Query class with custom code, serializing is the only way to represent the Query object and have Java load the additional implementation code. One easy to see use case is if Analyzer were made Serializable then indexing over the network and trying different analyzing techniques could be accomplished with ease in a grid computing environment.

"representations for queries independent of Lucene's Query, and map this to Lucene's Query. Is that not workable in this case?"

Mike wrote "if we add field X to a class implementing Serializable,
and must bump the SUID, that's a hard break on back compat. "

There needs to be "if statements" in readExternal to handle backwards compatibility. Given the number of classes, and the number of fields this isn't very much work. Neither are the test cases. I worked on RMI and Jini at Sun and elsewhere. I am happy to put forth the effort to maintain and develop this functionality. It is advantageous to place this functionality directly into the classes because in my experience many of the Lucene classes do not make all of the field data public, and things like dedicated serialization such as the XML query code are voluminous. Also the half support of serialization right now seems to indicate there really isn't support for it.

Hoss wrote: "sort of mythical "Lucene powerhouse"
Lucene seems to run itself quite differently than other open source Java projects. Perhaps it would be good to spell out the reasons for the reluctance to move ahead with features that developers work on, that work, but do not go in. The developer contributions seem to be quite low right now, especially compared to neighbor projects such as Hadoop. Is this because fewer people are using Lucene? Or is it due to the reluctance to work with the developer community? Unfortunately the perception in the eyes of some people who work on search related projects it is the latter.

Many developers seem to be working outside of Lucene and choosing not to open source in order to avoid going through the current hassles of getting code committed to the project.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

"In regards to Doug's comment about an alternate form... doesn't SOLR already have a XML based query format? If so, just persist the queries using this. You will be immune to serialization changes (provided the SOLR parser remains backwards compatible)."

SOLR does not have an XML based query format. XML is not ideal for distributed search because it is slow and verbose. There are many ways to serialize things, the issue is not in choosing one, but in supporting what most Java libraries do today which is native to the Java platform.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Mark: "There will be complaints no matter what. GWT tried getting around people having to implement Serializable by providing an interface with fewer promises: isSerizable. Many complained right away, as they had other classes that perhaps they where making Serializable simply for Apache XMLRpc or something. So now you can use either Serializable or isSerialiazble.

Personally, I think its fine to do as we are. I'm not against supporting more though. "

Externalizable and Serializable work interchangeably, a nice feature of Java. For classes that no one submits an Externalizable patch for, the serialVersionUID needs to be added. For ones that implement Externalizable, there is slightly more work, but not something someone with a year of Java experience can't maintain.

@asfimport
Copy link
Author

asfimport commented Dec 3, 2008

Doug Cutting (@cutting) (migrated from JIRA)

> Doesn't Hadoop handle versioning inside of Writeable classes?

Currently, yes, but this probably insufficient for a Hadoop 1.0 release. Hadoop is a distributed system, and would like to provide RPC back-compatibility across minor versions after we go 1.0. This is an explicit decision that's in the process of discussion within the Hadoop project. RPC's within Hadoop daemons will probably require identical versions – all daemons in a cluster must be upgraded in lockstep. We'll thus probably limit back-compatibility to client RPC's, so that a given client can talk to multiple clusters that are not running identical versions of Hadoop. Lucene has made no such explicit policy decision. Lucene is not an inherently distributed system.

Hadoop has not yet decided what mechanism to use to support back-compatible RPC, but Writable versioning is not sufficient, since it does not handle RPC protocols, and it's lower-level than we'd prefer. We'll probably go with something more like Thrift or ProtocolBuffers. Hadoop does not use Java serialization and makes no promises about that.

> The developer contributions seem to be quite low right now, especially compared to neighbor projects such as Hadoop.

As one who monitors both projects, I don't see a marked difference. In both there are sometimes patches that unfortunately languish, because, while they're important to the contributor, they fail to sufficiently engage a committer. For example, HADOOP-3422 took over 6 months to get committed, probably because not many committers use Ganglia.

There is a difference in quantity: hadoop-dev has over 4x the traffic of lucene-dev. But, normalized for that, the number of patches from non-committers feels comparable. If anything I'd guess Lucene commits more patches from non-committers than does Hadoop.

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

The documentation should probably be fixed to state that Lucene's use of Serializeable currently assumes that all parties are using the exact same version of Lucene. That's the default for Serializeable, but it probably bears stating explicitly. Then we should decide, going forward, whether this should change, and, if so, for which classes and how. Such a policy should be agreed on before code is written, no?

@asfimport
Copy link
Author

robert engels (migrated from JIRA)

Jason, you are only partially correct.

SOLR has XML definitions for updates/commits/deletes. It also supports string based queries. It also supports using XML for queries if you provide a handler, but the string syntax is simpler.

As for the serialization performance, you are mistaken.

For things like searches, the parsing time is extremely minor compared with the search time, so this additional overhead would be a fraction of the cost.

When returning results sets, using XML can make a huge difference, as the overhead to paid on every item. Even still with modern XML processors, the search time is still going to be the overriding performance factor by a huge margin. Typically "paged" results are also used, so again, the XML parsing compared to the network overhead, is going to be minor.

Still, if it is only for temporary serialization, binary works best - as long as it is Java to Java.

We have a search server that uses Java serialization for the message passing, including the results. It can be done without any changes to Lucene - again the overriding performance factor is the search itself (unless the queries returns 100k + documents and all are being returned - then the Java serialization time can be more than the search itself...

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Robert:
> using XML for queries if you provide a handler,

That doesn't sound like query serialization.

SOLR has a binary protocol due to criticisms about XML being slow.

I'm not sure why you and Doug and focusing on performance when that is not really the main issue I brought up.

Also I'm confused as to why dynamic classloading is being ignored by you folks as a Java feature that a Java search library could take advantage of to differentiate itself in the search (closed and open source) marketplace.

@asfimport
Copy link
Author

robert engels (migrated from JIRA)

The reason the XML is not needed, is because the string format is robust enough, and is simpler...

I am not focused on performance. It is just that Java serialization works well for temporary persistence. Other formats are better for long-term persistence. If you are only doing temporary persistence, you don't need backwards
compatibility.

Also if the API is exposed to the world (i.e. non-Java), a human readable (or even binary) format that is not based on Java serialization is going to work much better.

if you use a non-binary protocol it is far easier to extend it in the future, and retain the ability to easily read older versions. This is why Swing uses XML for serialization, and not binary.

You could certainly use specialized class loaders to load old versions of classes in order to maintain the ability to read old now incompatible classes... it is just a lot of work (maintenance too, need to keep the old code around... etc.) for not a lot of benefit.

As for SOLR's binary protocol, fine, but it is probably for a fringe use case, or the submitter didn't do real world tests... The XML parsing is just not that much greater than binary (at least in Java, since it is the object creation they both use that affects it). The search time is going to be far greater.

For large updates a binary loader can be more efficient that XML, but if you test it using real-world examples, I doubt you will see a huge difference - at least for the types of application TYPICALLY written using Lucene.

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

> I'm not sure why you and Doug and focusing on performance when that is not really the main issue I brought up.

In the description of this issue you claim, "This will make Serialization faster". I only added that, if this is a motivation, then it should be benchmarked and quantified. If it's not a motivation, and cross-version compatibility is the only motivation, then that should be clarified.

> dynamic classloading is being ignored by you folks

Perhaps most Lucene developers are not using dynamic class loading, just as most Lucene developers seem not to be relying on java serialization. Perhaps you can convince them to start using it, but if not, you may need to find a way to get Lucene to support it that minimally impacts other users of Lucene. Adding methods to Term and Query that must be changed whenever these classes change adds a cost. If folks don't see much benefit, then that cost outweighs. Perhaps you can better enlighten us to the benefits rather than assert willful ignorance? Jini promised great things with dynamic class loading, but the list of folks that use Jini is not long (http://www.jini.org/wiki/Who_uses_Jini%3F).

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

The discussion has evolved out of scope. Cross-version compatibility is the main goal. We have multiple versions of Lucene using the Spring RPC protocol. The standard way to solve this is add a serialVersionUID like HashMap, String, and the other Java classes do.

Performance is a topic of concern for Lucene users and existing RMI/Serialization users would transparently benefit by Externalizable being used.

I would like to implement this as a separate project, however performing reflection on each of the objects is not an efficient approach. Writing wrapper code for each and every variation of Query is a waste of time when it implements Serializable and the communication is between Java systems.

It seems best to remove Serialization from Lucene so that users are not confused and create a better solution.

> Perhaps most Lucene developers are not using dynamic class loading,

Dynamic classloading is popular and accepted in J2EE servers. If done properly it is a very convenient way of deploying Java based systems. Jini did not make this convenient enough. Jini did not have a specific problem it was trying to solve and so was too complex and open ended. Spring and J2EE make use of Java serialization for distributed objects. People may not be using Lucene for this today but this is due largely to lack of support for things like standard Serialization. With Lucene it is possible to make dynamic search classloading convenient in a search grid environment.

When J2EE was designed, no one was using Java on the server side. A framework was composed and a handful of companies implemented the specification and then found it's way into projects. If you are looking for users to ask for something that does not exist like this, it will not happen.

The interface Lucene exposes is relatively static and known. All of the top level classes, Query, Analyzer, Document, Similarity do not change very often. In a search based grid computing environment, the ability to execute arbitrary code against the cloud saves time and effort in deploying new code to servers. Restarting processes in a production environment is always expensive. If one is implementing for example mutating genetic algorithms using Java over Lucene then it would be advantageous to dynamically load the classes that implement this. They would be modifications of several classes such as Query, Similarity.

It is peculiar all the effort that goes into backwards compatibility of the index, but for Serialization it is ignored. This is and will be very confusing to users, especially ones who use J2EE.

@asfimport
Copy link
Author

Mark Harwood (@markharwood) (migrated from JIRA)

The contrib section of Lucene contains an XML-based query parser which aims to provide full-coverage of Lucene queries/filters and provide extensibility to support 3rd party classes.
I use this regularly in distributed deployments and this allows both non-Java clients and long-term persistence of queries with good stability across Lucene versions.
Although I have not conducted formal benchmarks I have not been drawn to XML parsing as a bottleneck - search execution and/or document retrieves are normally the main bottlenecks.

Maintaining XML parsing code is an overhead but ultimately helps decouple requests from the logic that executes requests. In serializing Lucene Query/Filter objects we are dealing with the classes which combine both the representation of the request criteria (what needs to be done) and the implementation (how things are done). We are forever finessing the "how" bit of this equation e.g. moving from RangeQuery to RangeFilters to TrieRangeFilter. The criteria however remains relatively static (" I just want to search on a range") and so it is dangerous to build clients that refer tdirectly to query implementation classes.
The XML parser provides a language-independent abstraction for clients to define what they want to be done without being too tied to how this is implemented.

Cheers
Mark

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

It seems best to remove Serialization from Lucene so that users are not confused and create a better solution.

I don't think that's the case. If we choose to only support "live serialization" then we should add "implements Serializable" but spell out clearly in the javadocs that there is no guarantee of cross-version compatibility ("long term persistence") and in fact that often there are incompatibilities.

I think "live serialization" is still a useful feature.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

For classes that no one submits an Externalizable patch for, the serialVersionUID needs to be added.

The serialVersionUID approach would be too simplistic, because we can't simply bump it up whenever we make a change since that then breaks back compatibility. We would have to override write/readObject or write/readExternal, and serialVersionUID would not be used.

@asfimport
Copy link
Author

John Wang (migrated from JIRA)

Mike:

   If you have class A implements Serializable, with a defined suid, say 1.

   Let A2 be a newer version of class A, and suid is not changed, say 1.

    Let's say A2 has a new field.

   Imaging A is running in VM1 and A2 is running in VM2. Serialization between VM1 and VM2 of class A is ok, just that A will not get the new fields. Which is fine since VM1 does not make use of it. 

   You can argue that A2 will not get the needed field from serialized A, but isn't that better than crashing?

    Either the case, I think the behavior is better than it is currently. (maybe that's why Eclipse and Findbug both report the lacking of suid definition in lucene code a warning)

   I agree adding Externalizable implementation is more work, but it would make the serialization story correct.

-John

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

> Serialization between VM1 and VM2 of class A is ok, just that A will not get the new fields. Which is fine since VM1 does not make use of it.

But VM1 might require an older field that the new field replaced, and VM1 may then crash in an unpredictable way. Not defining explicit suid's is more conservative: you get a well-defined exception when things might not work. Defining suid's but doing nothing else about compatibility is playing fast-and-loose: it might work in many cases, but it also might cause strange, hard-to-diagnose problems in others. If we want Lucene to work reliably across versions, then we need to commit to that goal as a project, define the limits of the compatibility, implement Externalizeable, add tests, etc. Just adding suid's doesn't achieve that, so far as I can see.

@asfimport
Copy link
Author

robert engels (migrated from JIRA)

Even if you changed SUIDs based on version changes, there is the very real possibility that the new code CAN'T be instantiated in any meaningful way from the old data. Then what would you do?

Even if you had all of the old classes, and their dependencies available from dynamic classloading, it still won't work UNLESS every new feature is designed with backwards compatibility with previous versions - a burden that is just too great when required of all Lucene code.

Given that, as has been discussed, there are other formats that can be used where isolated backwards persistence is desired (like XML based query descriptions). Even these won't work if the XML description references explicit classes - which is why designing such a format for a near limitless query structure (given user defined query classes) is probably impossible.

So strive for a decent solution that covers most cases, and fails gracefully when it can't work.

using standard serialization (with proper transient fields) seems to fit this bill, since in a stable API, most core classes should remain fairly constant, and those that are bound to change may take explicit steps in their serialization (if deemed needed)

@asfimport
Copy link
Author

John Wang (migrated from JIRA)

The discussion here is whether it is better to have 100% of the time failing vs. 10% of the time failing. (these are just meaningless numbers to express a point)
I do buy Doug's comment about getting into a weird state due to data serialization, but this is something Externalizable would solve.
This discussion has digressed to general Java serialization design, where it originally scoped only to several lucene classes.

If it is documented that lucene only supports serialization of classes from the same jar, is that really enough, doesn't it also depend on the compiler, if someone were to build their own jar?

Furthermore, in a distributed environment with lotsa machines, it is always idea to upgrade bit by bit, is taking this functionality away by imposing this restriction a good trade-off to just implementing Externalizable for a few classes, if Serializable is deemed to be dangerous, which I am not so sure given the lucene classes we are talking about.

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

> This discussion has digressed to general Java serialization design, where it originally scoped only to several lucene classes.

Which classes? The existing patch applies to one class. Jason said, "If it looks ok, I will implement Externalizable in other classes." but never said which. It would be good to know how wide the impact of the proposed change would be.

@asfimport
Copy link
Author

John Wang (migrated from JIRA)

For our problem, it is Query all all its derived and encapsulated classes. I guess the title of the bug is too generic.

As far as my comment about other lucene classes, one can just go to the lucene javadoc and click on "Tree" and look for Serializable. If you want me to, I can go an fetch the complete list, but here are some examples:

  1. Document (Field etc.)
  2. OpenBitSet, Filter ...
  3. Sort, SortField
  4. Term
  5. TopDocs, Hits etc.

For the top level API.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

LUCENE-1473.patch

serialVersionUID added to the relevant classes manually. Defaulted to 10 because it does not matter, as long it is different between versions. Thought of writing some code to go through the Lucene JAR, do an instanceof on the classes for Serializable and then verify that the serialVersionUID is 10.

Term implements Externalizable.

SerializationUtils was adapted from WriteableUtils of Hadoop for writing VLong.

TestSerialization use case does term serialization and serializes an arbitrary query to a file and compares them.

TODO:

  • Implement Externalizable
  • More unit tests? How to write a unit test for multiple versions?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

SerializeUtils is missing from the patch.

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

> How to write a unit test for multiple versions?

We can save, in files, serialized instances of each query type from the oldest release we intend to support. Then read each of thes queries and check that it s equal to a current query that's meant to be equivalent (ssuming all queries implement equals well). Something similar would need to be done for each class that is meant to be transmitted cross-version.

This tests that older queries may be processed by newer code. It does not test that newer queries can be processed by older code. Documentation is a big part of this effort, that should be completed first. What guarantees to we intend to provide? Once we've documented these, then we can begin writing tests. For example, we may only guarantee that older queries work with newer code, and that newer hits work with older code. To test that we'd need to have an old jar around that we could test against. This will be a trickier test to configure.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

LUCENE-1473.patch

Added Externalizable to Document, Field, AbstractField (as compared to the previous patch). SerializationUtils is included.

TODO:

  • More Externalizable classes with test cases for each one

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Doug wrote: We can save, in files, serialized instances of each query type from the oldest release we intend to support. Then read each of thes queries and check that it s equal to a current query that's meant to be equivalent (ssuming all queries implement equals well). Something similar would need to be done for each class that is meant to be transmitted cross-version.

This tests that older queries may be processed by newer code. It does not test that newer queries can be processed by older code. Documentation is a big part of this effort, that should be completed first. What guarantees to we intend to provide? Once we've documented these, then we can begin writing tests. For example, we may only guarantee that older queries work with newer code, and that newer hits work with older code. To test that we'd need to have an old jar around that we could test against. This will be a trickier test to configure.


Makes sense. I guarantee 2.9 and above classes will be backward compatible with the previous classes. I think that for 3.0 we'll start to create new replacement classes that will not conflict with the old classes. I'd really like to redesign the query, similarity, and scoring code to work with flexible indexing and allow new algorithms. This new code will not create changes in the existing query, similarity, and scoring code which will remain serialization compatible with 2.9. The 2.9 query, similarity, and scoring should leverage the new query, similarity and scoring code to be backwards compatible.

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

> I guarantee 2.9 and above classes will be backward compatible with the previous classes.

It sounds like you are personally guaranteeing that all serializeable classes will be forever compatible. That's not what we'd need. We'd need a proposed policy for the project to consider in terms of major and minor releases, specifying forward and/or backward compatibility guarantees. For example, we might say, "within a major release cycle, serialized queries from older releases will work with newer releases, however serialized queries from newer releases will not generally work with older releases, since we might add new kinds of queries in the course of a major release cycle". Similarly detailed statements would need to be made for each Externalizeable, no?

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

LUCENE-1473.patch

Added some more Externalizables.

o.a.l.util.Parameter is peculiar in that it implements readResolve to override the serialization and return a local object to emulate enums. I haven't figured out the places this is used and what the best approach is to externalize them.

TODO:

  • Same as before

Doug wrote: ""within a major release cycle, serialized queries from older releases will work with newer releases, however serialized queries from newer releases will not generally work with older releases, since we might add new kinds of queries in the course of a major release cycle". Similarly detailed statements would need to be made for each Externalizeable, no?"

Serialized objects in minor releases will work. Serialized objects of older versions starting with 2.9 will be compatible with newer versions. New versions will be compatible with older versions on a classes by class basis defined in the release notes. It could look something like this:

Serialization notes:
BooleanQuery added a scoreMap variable that does not have a default value in 3.0 and is now not backwards compatible with 2.9.
PhraseQuery added a ultraWeight variable that defaults to true in 3.0 and is backwards compatible with 2.9.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

I really wouldn't want to add another backwards-compatibility
requirement to Lucene, as the others already stated. Often in the past
was ensuring backwards-compatibility the part of writing patches that
took the longest and involved the most discussions.

But maybe we can come up with a fair compromise here. What if we
change the classes that currently implement Serializable, so that they
implement Externalizable and add a suid as Jason suggests. But we make
clear in the javadocs that we don't support backwards-compatiblity
here, so e.g. a Term externalized with Lucene 2.9 can't be read with
3.0, only with 2.9.
Then we add a new class CustomExternalizableReader to util:

public abstract class CustomExternalizableReader {
  public abstract void readExternal(Object obj, ObjectInput in)
      throws IOException, ClassNotFoundException;
} 

add a package-private, static variable of this type to a class that
implements Externalizable and implement the deserialization code in a
default instance of such a reader. This could look like this:

public class SomeClass implements Externalizable {
  private int one;
  private int two;

...

  static CustomExternalizableReader extReader = new CustomExternalizableReader() {
    public void readExternal(Object obj, ObjectInput in) throws IOException,
        ClassNotFoundException {
      SomeClass s = (SomeClass) obj;
      long uid = in.readLong();
      if (uid != serialVersionUID) {
        throw new IOException("Wrong serialVerionUID: " + uid);
      }
      int one = in.readInt();
      int two = in.readInt();
      s.init(one, two);
    }
  };

  // initialization method for readExternal
  void init(int one, int two) {
    this.one = one;
    this.two = two;
  }

Note that I also specified an init() method. Since both init() and
extReader are both package-private, they are not protected by our
backwards-compatibility policy and we can change them in any release.

Now if in the next version of this class we add a new variable 'three'
we have to change init() and the reader:

public class SomeClassNewVersion implements Externalizable {
  private int one;
  private int two;
  private int three;

  static final long serialVersionUID = 2L;

  public void readExternal(ObjectInput in) throws IOException,
      ClassNotFoundException {
    extReader.readExternal(this, in);
  }

  public void writeExternal(ObjectOutput out) throws IOException {
    out.writeLong(serialVersionUID);
    out.writeInt(one);
    out.writeInt(two);
    out.writeInt(three);
  }

  /**
   * This reader can only read the externalized format created with the same
   * version of this class. If backwards-compatibility is desired, a custom
   * reader has to be implemented.
   */
  static CustomExternalizableReader extReader = new CustomExternalizableReader() {
    public void readExternal(Object obj, ObjectInput in) throws IOException,
        ClassNotFoundException {
      SomeClassNewVersion s = (SomeClassNewVersion) obj;
      long uid = in.readLong();
      if (uid != serialVersionUID) {
        throw new IOException("Wrong serialVerionUID: " + uid);
      }
      int one = in.readInt();
      int two = in.readInt();
      int three = in.readInt();
      s.init(one, two, three);
    }
  };

  void init(int one, int two, int three) {
    this.one = one;
    this.two = two;
    this.three = three;
  }

Now if someone tries to deserialize an object that was written with
an old Lucene version, an exception will be thrown.

But the user can simply implement an own, backwards-compatible reader:

    // Now the user implements their own backwards compatible reader
    SomeClassNewVersion.extReader = new CustomExternalizableReader() {
      public void readExternal(Object obj, ObjectInput in) throws IOException,
          ClassNotFoundException {
        SomeClassNewVersion c_new = (SomeClassNewVersion) obj;
        long uid = in.readLong();
        int one = in.readInt();
        int two = in.readInt();
        int three;
        if (uid == 1) {
          // old version - initialze with default value
          three = -3;
        } else {
          // new version
          three = in.readInt();
        }
        c_new.init(one, two, three);
      }
    };

With this approach we have to clearly document in the javadocs the
externalization format. Also if externalizable classes contain private
inner classes that need to be serialized, then those inner classes
have to be made package-private.

The nice thing here is that we allow backwards-compatibility, but push
the burden of maintaining it to the user.

I coded this all up as an example that I'm attaching here. Let me know
what you think, please. The patch file contains a Demo.java with a main
method that demonstrates what I'm proposing here.

@asfimport
Copy link
Author

asfimport commented Dec 6, 2008

Michael McCandless (@mikemccand) (migrated from JIRA)

> Often in the past was ensuring backwards-compatibility the part of
> writing patches that took the longest and involved the most
> discussions.

It very much still is, as I'm learning with #2532!

Your first example is missing the read/writeExternal methods.

I think the proposed approach is rather heavy-weight – we will have
implemented readExternal, writeExternal, this new
CustomExtenralizableReader, package private init methods, make private
inner classes package private, the need to javadoc specifically the
current externalization format written for each of our classes, the
future need to help users to understand how they an achieve back
compatibility by subclassing CustomExternalizableReader, etc.

I guess my feeling is all of that is a good amount more work than just
deciding to directly implement back compatibility, ourselves.

EG, to do your example in a future world where we do support back
compat of serialized classes (NOTE – none of the code below is
compiled/tested):

First a util class for managing versions:

public class Versions {
  private int current;

  int add(String desc) {
    // TODO: do something more interesting with desc
    return current++;
  }

  void write(ObjectOutput out) throws IOException {
    // TODO: writeVInt
    out.writeByte((byte) current);
  }

  void read(ObjectInput in) throws IOException {
    // TODO: readVInt
    final byte version = in.readByte();
    if (version > current)
      throw new IOException("this object was serialized by a newer version of Lucene (got " + version + " but expected <= " + current + ")");
  }
}

Then, someone creates SomeClass:

public class SomeClass implements Externalizable {
  private int one;
  private int two;

  private static final Versions versions = new Versions();
  private static final int VERSION0 = versions.add("start");

  public SomeClass() {};

  public void writeExternal(ObjectOutput out) throws IOException {
    versions.write(out);
    out.writeInt(one);
    out.writeInt(two);
  }

  public void readExternal(ObjectInput in) throws IOException {
    versions.read(in);
    one = in.readInt();
    two = in.readInt();
  }

  ...
}

Then on adding field three:

public class SomeClass implements Externalizable {
  private int one;
  private int two;
  private int three;

  private static final Versions versions = new Versions();
  private static final int VERSION0 = versions.add("start");
  private static final int VERSION1 = versions.add("the new field three");

  public SomeClass() {};

  public void writeExternal(ObjectOutput out) throws IOException {
    versions.write(out);
    out.writeInt(one);
    out.writeInt(two);
  }

  public void readExternal(ObjectInput in) throws IOException {
    int version = versions.read(in);
    one = in.readInt();
    two = in.readInt();
    if (version >= VERSION1)
      three = in.readInt();
    else
      // default
      three = -3;
  }

  ...
}

In fact I think we should switch to Versions utils class for writing/reading our index files...

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Your first example is missing the read/writeExternal methods.

Oups, I forgot to copy&paste it. It's in the attached patch file though.

I think the proposed approach is rather heavy-weight

Really? In case we go the Externalizable way anyway, then I think this
approach doesn't add too much overhead. You only need to add init()
and move the deserialization code from readExternal() to the reader's
readExternal. It's really not too much more code.

And, the code changes are straightforward when the class changes. No
need to worry about how to initialize newly added variables if an old
version is read for example.

What I think will be the most work is documenting and explaining
this. But this would be an expert API, so probably people who really
need to use it are most likely looking into the sources anyway.

But for the record: I'm totally fine with using Serializable and just
adding the serialVersionUID. Just if we use Externalizable, we might
want to consider something like this to avoid new backwards-
compatibility requirements.

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

Would it take any more lines of code to remove Serializeable from the core classes and re-implement RemoteSearchable in a separate layer on top of the core APIs? That layer could be a contrib module and could get all the externalizeable love it needs. It could support a specific popular subset of query and filter classes, rather than arbitrary Query implementations. It would be extensible, so that if folks wanted to support new kinds of queries, they easily could. This other approach seems like a slippery slope, complicating already complex code with new concerns. It would be better to encapsulate these concerns in a layer atop APIs whose back-compatibility we already make promises about, no?

@asfimport
Copy link
Author

Wolf Siberski (migrated from JIRA)

This seems to be the right way to go. The patch attached removes all dependencies to Serializable and Remote from the core and moves it to contrib/remote. I introduced a new interface RemoteSearcher
(not RemoteSearchable because I didn't want to pass Weights around), implemented by DefaultRemoteSearcher. An adapter realizing Searchable and delegating to RemoteSearcher is also included (RemoteSearcherAdapter. Encoding/Decoding of Lucene objects is delegated to the org.apache.lucene.remote.Serializer. For a sample serialization, I employed XStream which offers XML serialization (nearly) out-of-the-box.
Everything is rather undocumented and would need a lot of cleanup, but as proof-of-concept it should be ok. Core and remote tests pass, with one exception: it is not possible anymore to serialize a RAMDirectory.
What I don't like with the current patch is that a lot of different objects are passed around to keep the Searchable interface alive. Would it be possible to refactor such that Searchable represents a higher-level interface (or introduce a new alternative abstraction)?

@asfimport
Copy link
Author

Wolf Siberski (migrated from JIRA)

This patch removes all dependencies to Serializable and Remote from the core and adds contrib/remote as replacement

@asfimport
Copy link
Author

asfimport commented Dec 11, 2008

Mark Miller (@markrmiller) (migrated from JIRA)

Thanks Wolf, +1 on the change. This issue proposes to do the same thing: #2481

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

Thanks, Wolf, this looks like a promising approach.

Jason, John: would this sort of thing meet your needs?

I'm not sure we can remove everything from trunk immediately. Rather we should deprecate things and remove them in 3.0. The removal of Serializeable will break compatibility, so must be well-advertised.

HitCollector-based search should simply not be supported in distributed search. The Searchable API was designed for remote use and does not include HitCollector-based access.

Weighting, and hence ranking, does not appear to be implemented correctly by this patch. An approach that might work would be to:

  • extend MultiSearcher
  • pass its CachedDfSource to remote searchers along with queries
  • construct a Weight on the search node using the CachedDfSource
    Does that make sense?

@asfimport
Copy link
Author

asfimport commented Dec 12, 2008

Jason Rutherglen (migrated from JIRA)

To Wolf: Your patch looked like it was quite a bit of work, nice job! Restricting people to XML will probably not be suitable though. Some may want JSON or something that more directly encodes the objects.

General:
It seems the alternative solutions to serialization simply shift the problem around but do not really solve the underlying issues (speed, versioning, writing custom serialization code, and perhaps dynamic classloading). The externalizable code will not be too lengthy and should be more convenient than alternatives to implement (with the code necessary being roughly equivalent to an equals method). For example protocol buffers requires maintaining files that remind me of IDL files from CORBA to describe the objects.

Deprecating serialization entirely needs to be taken to the java-user mailing list as there are quite a number of installations relying on it. If this is something that overlaps with SOLR then it would be good for the SOLR folks to separate it out as a serialization library that could be used outside of the SOLR server. This would be a good idea for most of the SOLR functionality otherwise there would seem to be redundant development occurring.

I'll finish up the Externalizable patch once #2391 is completed (IndexReader.clone) as it is something that needs feedback and testing to ensure it's workable for 2.9, whereas Externalizable is somewhat easier.

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

> shift the problem around but do not really solve the underlying issues

That's the idea, actually, to shift it out of the core into contrib. We could use Externalizeable there, with no XML.

> Deprecating serialization entirely needs to be taken to the java-user mailing list as there are quite a number of installations relying on it.

No, we make decisions on the java-dev mailing list. Also, it won't go away, folks might just have to update their code to use different APIs if and when when they upgrade to 3.0.

@asfimport
Copy link
Author

Wolf Siberski (migrated from JIRA)

Thanks to Doug and Jason for your constructive feedback. Let me first clarify the purpose and scope of the patch. IMHO, the discussion about Serialization in Lucene is not clear-cut at all. My opinion is that moving all distribution-related code out of the core leads to a cleaner separation of concerns and thus is better design. On the other hand with removing Serializable we limit the Lucene application space at least a bit (e.g., no support for dynamic class loading), and abandon the advantages default Java serialization offers. Therefore the patch is to be taken as contribution to explore the design space (as Michaels patch on custom readers explored the Serializable option), and not as a full-fledged solution proposal.

> [Doug] The removal of Serializeable will break compatibility, so must be well-advertised.
Sure. I removed Serializable to catch all related errors; this was not meant as proposal for a final patch.

> [Doug] The Searchable API was designed for remote use and does not include HitCollector-based access.
Currently Searchable does include a HitCollector-based search method, although the comment says that 'HitCollector-based access to remote indexes is discouraged'. The only reason to provide an implementation is that I wanted to keep the Searchable contract. Is remote access the only purpose of Searchable/MultiSearcher? Is it ok to break compatibility with respect to these classes? IMHO a significant fraction of the current clumsiness in the remote package stems from my attempt to fully preserve the Searchable API.

> [Doug] Weighting, and hence ranking, does not appear to be implemented correctly by this patch.
True, I was a bit too fast here. We could either solve it along the line you propose, or revert to pass the Weight again instead of the Query. The issue IMHO is orthogonal to the Serializable discussion and more related to the question how a good remote search interface and protocol should look like.

> [Jason] Restricting people to XML will probably not be suitable though.
The patch does not limit serialization to XML. It just requires that encoding to and decoding from String is implemented, no matter how. I used XML/XStream as proof-of-concept implementation, but don't propose to make XML mandatory. The main reason for introduction of the Serializer interface was to emphasize that XML/XStream is just one implemantation option. Actually, the current approach feels like at least one indirection more than required; for a final solution I would try to come up with a better design.

> [Jason] It seems the alternative solutions to serialization simply shift the problem around but do not really solve
> the underlying issues (speed, versioning, writing custom serialization code, and perhaps dynamic classloading).
In a sense, the problem is indeed 'only' shifted around and not yet solved. The good thing about this shift is that Lucene core becomes decoupled from these issues. The only real limitation I see is that dynamic classloading can't be realized anymore.

With respect to speed, I don't think that encoding/decoding is a significant performance factor in distributed search, but this would need to be benchmarked. With respect to versioning, my patch still keeps all options open. What is more important, Lucene users can now decide if they need compatibility between different versions, and roll their own encoding/decoding if they need it. Of course, if they are willing to contribute and maintain custom serializers which preserve back compatibility, they can do it in contrib as well as they could have done it in the core. Custom serialization is still possible although the standard Java serialization framework can't be used anymore for that purpose, and I admit that this is a disadvantage.

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

> Therefore the patch is to be taken as contribution to explore the design space [ ... ]

Yes, and it is much appreciated for that. Thanks again!

> Currently Searchable does include a HitCollector-based search method [ ... ]

You're right. I misremembered. This dates back to the origin of Searchable.

http://svn.apache.org/viewvc?view=rev&revision=149813

Personally, I think it would be reasonable for a distributed implementation to throw an exception if one tries to use a HitCollector.

> We could either solve it along the line you propose, or revert to pass the Weight again instead of the Query.

Without using an introspection-based serialization like Java serialization it would be difficult to pass a Weight over the wire using public APIs, since most implementations are not public. But, since Weight's are constructed via a standard protocol, the method I outlined could work.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Won't be working on these and they're old

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant