Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose iterator over query terms in TermInSetQuery #12280

Closed
wants to merge 4 commits into from

Conversation

gsmiller
Copy link
Contributor

@gsmiller gsmiller commented May 8, 2023

Description

I'd like to propose we add an API to TermInSetQuery that exposes an iterator over the query terms. This is useful for extending TermInSetQuery. One concrete use-case for this is needing to change the way term intersection happens with the indexed terms dictionary to support bloom filters, as described in this email thread.

I don't think there's any harm in exposing this, but am interested in feedback of course! This abstraction decouples the current prefix-coding implementation details, so it seems clean.

@rmuir
Copy link
Member

rmuir commented May 9, 2023

Please lets not go this path. It is a mutlitermquery, if you want to change how it works behind the scenes, you "plugin" with RewriteMethod

@gsmiller
Copy link
Contributor Author

gsmiller commented May 9, 2023

Thanks @rmuir. It would be ideal if we could do this through RewriteMethod, but I'm not sure how we can actually accomplish that. The problem is in the implementation of getTerms, as defined in TermInSetQuery. We can plug in through RewriteMethod#getTerms, but we still need access to an iterator of the query terms. I'll draft up a sandbox query that might help illustrate the issue and we can discuss further. If there's a better way to go about this, happy to explore it. Thanks again!

Update: I added a sandbox query to this PR just to demonstrate the use-case for extending TermInSetQuery and overriding getTerms. It shows how this proposed getQueryTerms method would be used. Happy to consider other ways we might support this sort of use-case. Thanks again.

* TermsEnum#seekCeil(BytesRef)} to produce a terms iterator, which is compatible with {@code
* BloomFilteringPostingsFormat}.
*/
public class PKTermInSetQuery extends TermInSetQuery {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is for demo purposes only. I'm not suggesting we merge it as part of this PR. I only want to demonstrate how a class might leverage getQueryTerms.

@rmuir
Copy link
Member

rmuir commented May 9, 2023

I still think it doesn't make sense to me to expose this. As i said on the dev list, your problem is that you use a custom postings format and you want it to accelerate the intersection.

The cleanest way to do this, is to handoff the intersection to the postingsformat directly, rather than worry about seekCeil/seekExact and subclassing queries or exposing stuff. It should give a performance improvement using the default postings format as well (at least it did for other queries when mikemccand added it)

So, IMO we should try to fix this query to use Terms.intersect() [see #12176], then override Terms.intersect for the BloomPostingsFormat to make use of the bloom filters to speed up intersection.

@gsmiller
Copy link
Contributor Author

gsmiller commented May 9, 2023

Got it, thanks @rmuir. I hadn't seen your dev list reply yet. This all makes sense. I'll close this out and have a look at leveraging intersect. Seems like a better path forward. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants