Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Lucene's PackedInt dependency from Cuckoo filter #74736

Merged
merged 8 commits into from
Jul 6, 2021

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Jun 30, 2021

Currently, Cuckoo filters use Lucene structures to hold in memory arrays of packed ints. This is problematic when we move to Lucene 9.0 as those classes have been heavily modified, In addition we use Lucene Input/output API to serialise / deserialise the data structure which will break as well as Lucene is changing the API endianness.

Therefore, this PR forks the Lucene data structure into a class called PackedArray. The main change is that this class will be using Elasticsearch streams to serialise / deserialise, therefore we remove the dependency in Lucene's endianness. WE are still using lucene PackedInts to handle backwards compatibility.

We added a ESRestTestCase that can trigger the creation of Cuckoo filters and run in a mixed cluster in order to check backwards compatibility.

relates to #74057

@iverase iverase requested review from nik9000 and romseygeek June 30, 2021 08:56
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 30, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're still relying on API code in lucene that won't be present in v9 so the mixed-cluster branches won't work here?

public void readBytes(byte[] b, int offset, int len) throws IOException {
in.readBytes(b, offset, len);
if (in.getVersion().before(Version.V_8_0_0)) {
final PackedInts.Reader reader = PackedInts.getReader(new DataInput() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PackedInts.getReader() doesn't exist in lucene 9, so is the plan to remove this further in a follow-up or do we need to rework it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be removed once we backport the change to 7.x

}
});
mutable.save(new DataOutput() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, save() is no longer in the lucene 9 API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be removed once we backport the change to 7.x

@iverase
Copy link
Contributor Author

iverase commented Jun 30, 2021

@romseygeek true but this is the first step. Once we move this to 7.x, we can remove the dependency in master?

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, so the dance goes something like: backport this to 7x, adjust the version checks to be before/after 7.15, remove the BWC branch from master entirely. LGTM then, thanks @iverase!

@iverase
Copy link
Contributor Author

iverase commented Jun 30, 2021

Yes, that's the dance


public void testSingleValuedString() throws Exception {
final Settings.Builder settings = Settings.builder()
.put(IndexMetadata.INDEX_NUMBER_OF_SHARDS_SETTING.getKey(), 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth leaving a comment about why you chose 2 here and 12000 - 17000 and 5 below. I know they were fairly carefully chosen to get us to trigger the cuckoo filter. Its probably worth writing that out so future me won't blindly screw this up.

Oh!!!!!!!!!! Can we add to the profile results whether or not we used the cuckoo filter? Like a count of the number of times we used it? That way we can assert in this test that we hit it. Without that assertion this test could bit rot fairly easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment about the choice of the number of docs. Not sure about adding such implementations details into the profiler. I guess it will be nice if the agg reported the precision of the result if possible, then you will know if the egg never went into cuckoo filter if the result is fully precise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about adding such implementations details into the profiler. I guess it will be nice if the agg reported the precision of the result if possible, then you will know if the egg never went into cuckoo filter if the result is fully precise.

That's kind of the point of the profiler - if you've ever wondered "did it do x or y" - you can just make it tell you. But, yeah, maybe it'd be nice to add a precise_results boolean or something then there isn't really a need for the profiler here at all. Either of those can wait for a follow up - but i think something to make sure the test can assert that it hit the non-precise case is important soon-ish.

@iverase
Copy link
Contributor Author

iverase commented Jul 1, 2021

open #74817

@elasticmachine run elasticsearch-ci/part-2

@iverase
Copy link
Contributor Author

iverase commented Jul 1, 2021

@elasticmachine run elasticsearch-ci/part-2

@iverase
Copy link
Contributor Author

iverase commented Jul 1, 2021

@elasticmachine test this please

@iverase
Copy link
Contributor Author

iverase commented Jul 1, 2021

@elasticmachine test this please

@iverase
Copy link
Contributor Author

iverase commented Jul 1, 2021

@elasticmachine run elasticsearch-ci/part-2

@iverase
Copy link
Contributor Author

iverase commented Jul 1, 2021

@elasticmachine update branch

@iverase iverase merged commit 24ff395 into elastic:master Jul 6, 2021
@iverase iverase deleted the cuckooFilterImpl branch July 6, 2021 04:26
iverase added a commit that referenced this pull request Jul 6, 2021
After backporting #74736, remove the BwC serialisation for 8.0.
@iverase iverase mentioned this pull request Jul 6, 2021
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >refactoring Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants