Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding aggregations support for the _ignored field #101373

Merged
merged 123 commits into from
Apr 29, 2024

Conversation

eyalkoren
Copy link
Contributor

@eyalkoren eyalkoren commented Oct 26, 2023

Closes #59946

The actual implementation needs to take into account how we decide to implement #101153.
One of the original requirements was to stop making the _ignored field stored while adding doc_values to it. However, the current proposal of #101153 assumes that we can access the original order of the field's content, which means we must keep it stored as well. The eventual approach will determine the implementation of this change.

@eyalkoren eyalkoren added >feature :Search/Search Search-related issues that do not fall into other categories v8.12.0 labels Oct 26, 2023
@eyalkoren eyalkoren self-assigned this Oct 26, 2023
@eyalkoren eyalkoren changed the title Adding aggregations support for the _ignored field Adding aggregations support for the _ignored field Oct 26, 2023
@github-actions
Copy link
Contributor

Documentation preview:

@elasticsearchmachine elasticsearchmachine added the external-contributor Pull request authored by a developer outside the Elasticsearch team label Oct 26, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @eyalkoren, I've created a changelog YAML for you.

@Override
public IndexFieldData.Builder fielddataBuilder(FieldDataContext fieldDataContext) {
if (hasDocValues() == false) {
throw new IllegalArgumentException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when you're aggregating over a data stream that contains both old and new indices? Will this return a partial failure? How does Kibana/Lens handle the failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I will look for a way to add an integration test for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a general question about aggregations on a field in a data stream, I guess we can add a test that is not related to the _ignored field.
Something like:

  1. create a component template that has a mapping for field foo with "doc_values": false
  2. create a data stream index template that uses this mapping
  3. index a document
  4. update the component template so that the mapping for field foo is changed to "doc_values": true
  5. rollover the data stream
  6. index another document
  7. check aggregation on field foo

Would that cover it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would indeed get partial failures in this case, one failure for each shard belonging to the old indices, as they can't be aggregated on.

@eyalkoren
Copy link
Contributor Author

eyalkoren commented Nov 20, 2023

@javanna please take a look to see if this is the right direction.
Some specific input I'd be happy to get:

@javanna
Copy link
Member

javanna commented Nov 24, 2023

Only for the sake of #59946, we don't need to keep the _ignored field stored. However, for #101153 we may need it. What do you think we should do as part of this PR?

I am leaning towards removing the stored field, but I think this requires further discussion, because it is going to be very difficult to go back to relying on the order if we stop storing the field. We said we won't focus for now on the ignored reason, and I would add that we should try not to have positional logic around it when we do so. Could you make the changes necessary to retrieve the field from doc values and stop storing the field in the meantime?

Can you provide an advice of how to create a test for that? EDIT: I proposed a test scenario, let me know whether you agree that it covers what we are looking for

I am not sure that we need to test for that. We know we are going to have partial failures in that case.

@eyalkoren
Copy link
Contributor Author

Thanks for the feedback @javanna 🙏

Could you make the changes necessary to retrieve the field from doc values and stop storing the field in the meantime?

I will give it a try 🙂

I am not sure that we need to test for that. We know we are going to have partial failures in that case.

OK, then I guess the next question is - what do we do about that? Only document it as a caveat related to aggregating on existing data streams? Something else?

@felixbarny
Copy link
Member

OK, then I guess the next question is - what do we do about that? Only document it as a caveat related to aggregating on existing data streams? Something else?

IMO, it would be enough to document that in the same place where you document that the field is supports aggregations now.

@eyalkoren eyalkoren added Team:obs-knowledge Meta label for Observability Knowledge team and removed Team:obs-knowledge Meta label for Observability Knowledge team labels Nov 29, 2023
@eyalkoren
Copy link
Contributor Author

eyalkoren commented Nov 29, 2023

@javanna the current state is non-functional, but I pushed so you can see what I have done so far.
I'd be happy for advice on how I can debug the fetch phase and specifically the invocation of valueFetcher(). I didn't find a unit test that invokes it yet...

What I know is that the FetchDocValuesContext that comes from the FetchContext is null and that the doc values list that comes from the SearchSourceBuilder is also null. Not sure if this is any help...

@@ -194,7 +194,7 @@ The API returns the following result:
"load_source_count": 5
},
"debug": {
"stored_fields": ["_id", "_ignored", "_routing", "_source"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not rerturned anymore because the field is not stored anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, I was expecting this change.

@@ -123,12 +123,12 @@ public void testWithIgnored() {
{
GetResponse getResponse = client().prepareGet("test", "1").get();
assertTrue(getResponse.isExists());
assertThat(getResponse.getField("_ignored"), nullValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not required...will restore it.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments, but I think this is very close.

@@ -194,7 +194,7 @@ The API returns the following result:
"load_source_count": 5
},
"debug": {
"stored_fields": ["_id", "_ignored", "_routing", "_source"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, I was expecting this change.

@salvatore-campagna
Copy link
Contributor

salvatore-campagna commented Apr 24, 2024

Ideally we should stop using skip in yaml tests for things that require versions above 8.14 and move to using cluster features. Anyway I see there are more yaml tests still using skip and I would prefer to handle those in a separate PR so that we apply changes to all yaml tests where that is required and avoid adding more stuff to this PR.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple questions, mostly around testing. LGTM otherwise! Great work!

@@ -140,7 +140,7 @@ profile fetch:
- gt: { profile.shards.0.fetch.breakdown.next_reader: 0 }
- gt: { profile.shards.0.fetch.breakdown.load_stored_fields_count: 0 }
- gt: { profile.shards.0.fetch.breakdown.load_stored_fields: 0 }
- match: { profile.shards.0.fetch.debug.stored_fields: [_id, _ignored, _routing, _source] }
- match: { profile.shards.0.fetch.debug.stored_fields: [_id, _routing, _source] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if with this change, the skip above needs updating? Isn't it surprising that this test runs? 8.14 returns the _ignored field while 8.15 does not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this but the other was ok.

import java.util.Locale;
import java.util.Map;

public class IgnoredMetaFieldRollingUpgradeIT extends ParameterizedRollingUpgradeTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great test to have!

@@ -139,7 +139,7 @@ fetch nested source:
- gt: { profile.shards.0.fetch.breakdown.next_reader: 0 }
- gt: { profile.shards.0.fetch.breakdown.load_stored_fields_count: 0 }
- gt: { profile.shards.0.fetch.breakdown.load_stored_fields: 0 }
- match: { profile.shards.0.fetch.debug.stored_fields: [_id, _ignored, _routing, _source] }
- match: { profile.shards.0.fetch.debug.stored_fields: [_id, _routing, _source] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, I wonder if we need to update the skip above, that's what I would expect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I forgot about this commit that already made that change: 19db490

}
}

public void testIgnoredMetadataFieldFetch() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a duplicate of MetadatFetchingIT#testWithIgnored ? Is it needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this comment...I can remove the test in another PR.

@salvatore-campagna salvatore-campagna merged commit ee26295 into elastic:main Apr 29, 2024
14 checks passed
@@ -104,6 +104,7 @@ private static IndexVersion def(int id, Version luceneVersion) {
public static final IndexVersion UPGRADE_TO_LUCENE_9_10 = def(8_503_00_0, Version.LUCENE_9_10_0);
public static final IndexVersion TIME_SERIES_ROUTING_HASH_IN_ID = def(8_504_00_0, Version.LUCENE_9_10_0);
public static final IndexVersion DEFAULT_DENSE_VECTOR_TO_INT8_HNSW = def(8_505_00_0, Version.LUCENE_9_10_0);
public static final IndexVersion DOC_VALUES_FOR_IGNORED_META_FIELD = def(8_505_00_1, Version.LUCENE_9_10_0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eyalkoren I've just spotted this - this should have incremented the NNN version of the version id, not the patch version. Check the comment right below the version constants for more info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@salvatore-campagna is the one two brought this to completion, thanks @thecoop for raising it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories :StorageEngine/Logs You know, for Logs Team:Search Meta label for search team Team:StorageEngine v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aggregating _ignored field values
10 participants