Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid attempting to load the same empty field twice in fetch phase #107551

Merged
merged 11 commits into from
Apr 17, 2024

Conversation

javanna
Copy link
Member

@javanna javanna commented Apr 16, 2024

During the fetch phase, there's a number of stored fields that are requested explicitly or loaded by default. That information is included in StoredFieldsSpec that each fetch sub phase exposes.

We attempt to provide stored fields that are already loaded to the fields lookup that scripts as well as value fetchers use to load field values (via SearchLookup). This is done in PreloadedFieldLookupProvider. The current logic makes available values for fields that have been found, so that scripts or value fetchers that request them don't load them again ad-hoc. What happens though for stored fields that don't have a value for a specific doc, is that they are treated like any other field that was not requested, and loaded again, although they will not be found, which causes overhead.

This change makes available to PreloadedFieldLookupProvider the list of required stored fields, so that it can better distinguish between fields that we already attempted to load (although we may not have found a value for them) and those that need to be loaded ad-hoc (for instance because a script is requesting them for the first time).

This is an existing issue, that has become evident as we moved fetching of metadata fields to FetchFieldsPhase, that relies on value fetchers, and hence on SearchLookup. We end up attempting to load default metadata fields (_ignored and _routing) twice when they are not present in a document, which makes us call LeafReader#storedFields additional times for the same document providing a SingleFieldVisitor that will never find a value.

Another existing issue that this PR fixes is for the FetchFieldsPhase to extend the StoredFieldsSpec that it exposes to include the metadata fields that the phase is now responsible for loading. That results in _ignored being included in the output of the debug stored fields section when profiling is enabled. The fact that it was previously missing is an existing bug (it was missing in StoredFieldLoader#fieldsToLoad).

Yet another existing issues that this PR fixes is that _id has been until now always loaded on demand when requested via fetch fields or script. That is because it is not part of the preloaded stored fields that the fetch phase passes over to the PreloadedFieldLookupProvider. That causes overhead as the field has already been loaded, and should not be loaded once again when explicitly requested.

During the fetch phase, there's a number of stored fields that are requested explicitly or loaded by default.
That information is included in `StoredFieldsSpec` that each fetch sub phase exposes.

We attempt to provide stored fields that are already loaded to the fields lookup that scripts as well as value
fetchers use to load field values (via SearchLookup). This is done in PreloadedFieldLookupProvider. The current logic
makes available values for fields that have been found, so that scripts or value fetchers that request them don't load
them again ad-hoc. What happens though for stored fields that don't have a value for a specific doc, is that they
are treated like any other field that was not requested, and loaded again, which causes overhead.

This change makes available to PreloadedFieldLookupProvider the list of required stored fields, so that it can better
distinguish between fields that we already attempted to load (although we may not have found a value for them) and those
that need to be loaded ad-hoc (for instance because a script is requesting them for the first time).

This is an existing issue, that has become evident as we moved fetching of metadata fields to FetchFieldsPhase, that relies
on value fetchers, and hence on SearchLookup. We end up attempting to load default metadata fields (`_ignored` and `_routing`)
twice when they are not present in a document, which makes us call `LeafReader#storedFields` additional times for the same
document providing a `SingleFieldVisitor` that will never find a value.
@@ -1051,7 +1051,7 @@ And here is the fetch profile:
"load_source_count": 5
},
"debug": {
"stored_fields": ["_id", "_routing", "_source"]
"stored_fields": ["_id", "_ignored", "_routing", "_source"]
Copy link
Member Author

@javanna javanna Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a consequence of exposing the correct stored fields spec in FetchFieldsPhase, that takes stored metadata fields into account. _ignored will be removed again once it's no longer stored. The problem is that the field should have been there since its introduction, but it was never added to StoredFieldLoader#fieldsToLoad which is where the other three fields come from. Note that StoredFieldsPhase did not include in its stored fields spec the default metadata fields that it always requested.

For posterity, why is _type not there if fetch fields phase requests it by default? Because it is not mapped in recent clusters, and it is only part of the stored fields spec, hence loaded, when it is mapped, which is the case only in very old archive indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

private Set<String> preloadedStoredFields;
private Map<String, List<Object>> storedFields;
private LeafFieldLookupProvider backUpLoader;
private Supplier<LeafFieldLookupProvider> loaderSupplier;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the chance to make these private and add package private setter/getter methods when needed. I find that it clarifies who does what and when.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

@salvatore-campagna salvatore-campagna Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question...do I understand it correctly that preloadedStoredFieldValues.keySet() is the same as preloadedStoredFieldNames? Or is one a subset of the other?

Because later on I see we do

if (preloadedStoredFieldNames.get().contains(field)) {
            fieldLookup.setValues(preloadedStoredFieldValues.get(field));

which looks like as "if the name is there and the field was preloaded then just get the preloaded values"...

I see that one comes from hit.loadedFields() and the other from StoredFieldsSpec but I was wondering if they always include the same fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they were the same we would not need a separate set. preloadedStoredFieldNames includes all the fields that we know we will attempt to load for all documents. Those include fields that don't have a value, while preloadedStoredFieldValues contains only those fields that were found in the current doc.

The overhead was caused by trying to load _ignored and _routing ad-hoc when requested for all docs that did not have a value for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thank you.


@Override
public void populateFieldLookup(FieldLookup fieldLookup, int doc) throws IOException {
String field = fieldLookup.fieldType().name();
if (storedFields.containsKey(field)) {

if (field.equals(IdFieldMapper.NAME)) {
Copy link
Member Author

@javanna javanna Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_id was previously always loaded via the backup loader when requested via script or via fetch fields. That causes overhead as the field is already available at all times, no need to go and fetch it from stored fields!

This is because it is not part of the ordinary loaded stored fields, hence it needs to be provided and handled separately.

@javanna javanna added :Search/Search Search-related issues that do not fall into other categories >bug labels Apr 17, 2024
@javanna javanna marked this pull request as ready for review April 17, 2024 08:08
@elasticsearchmachine
Copy link
Collaborator

Hi @javanna, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Apr 17, 2024
@@ -1051,7 +1051,7 @@ And here is the fetch profile:
"load_source_count": 5
},
"debug": {
"stored_fields": ["_id", "_routing", "_source"]
"stored_fields": ["_id", "_ignored", "_routing", "_source"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

private Set<String> preloadedStoredFields;
private Map<String, List<Object>> storedFields;
private LeafFieldLookupProvider backUpLoader;
private Supplier<LeafFieldLookupProvider> loaderSupplier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

this.preloadedStoredFields = preloadedStoredFields;
}

void setStoredFields(String id, Map<String, List<Object>> storedFields) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe setPreloadedStoredFieldValues and the one above is setPreloadedStoredFieldsNames. That way it's clear these are mirror of eachother.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, done.

@salvatore-campagna
Copy link
Contributor

Thanks Luca, I just left a question to better understand how this works. Everything else LGTM.

@javanna javanna added v8.14.0 auto-backport Automatically create backport pull requests when merged labels Apr 17, 2024
@javanna javanna merged commit 223e7f8 into elastic:main Apr 17, 2024
14 checks passed
@javanna javanna deleted the fix/fields_lookup_stored_spec branch April 17, 2024 17:37
javanna added a commit to javanna/elasticsearch that referenced this pull request Apr 17, 2024
…lastic#107551)

During the fetch phase, there's a number of stored fields that are requested explicitly or loaded by default. That information is included in `StoredFieldsSpec` that each fetch sub phase exposes.

We attempt to provide stored fields that are already loaded to the fields lookup that scripts as well as value fetchers use to load field values (via `SearchLookup`). This is done in `PreloadedFieldLookupProvider.` The current logic makes available values for fields that have been found, so that scripts or value fetchers that request them don't load them again ad-hoc. What happens though for stored fields that don't have a value for a specific doc, is that they are treated like any other field that was not requested, and loaded again, although they will not be found, which causes overhead.

This change makes available to `PreloadedFieldLookupProvider` the list of required stored fields, so that it can better distinguish between fields that we already attempted to load (although we may not have found a value for them) and those that need to be loaded ad-hoc (for instance because a script is requesting them for the first time).

This is an existing issue, that has become evident as we moved fetching of metadata fields to `FetchFieldsPhase`, that relies on value fetchers, and hence on `SearchLookup`. We end up attempting to load default metadata fields (`_ignored` and `_routing`) twice when they are not present in a document, which makes us call `LeafReader#storedFields` additional times for the same document providing a `SingleFieldVisitor` that will never find a value.

Another existing issue that this PR fixes is for the `FetchFieldsPhase` to extend the `StoredFieldsSpec` that it exposes to include the metadata fields that the phase is now responsible for loading. That results in `_ignored` being included in the output of the debug stored fields section when profiling is enabled. The fact that it was previously missing is an existing bug (it was missing in `StoredFieldLoader#fieldsToLoad`).

Yet another existing issues that this PR fixes is that `_id` has been until now always loaded on demand when requested via fetch fields or script. That is because it is not part of the preloaded stored fields that the fetch phase passes over to the `PreloadedFieldLookupProvider`. That causes overhead as the field has already been loaded, and should not be loaded once again when explicitly requested.
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.14

javanna added a commit that referenced this pull request Apr 18, 2024
…107551)

During the fetch phase, there's a number of stored fields that are requested explicitly or loaded by default. That information is included in `StoredFieldsSpec` that each fetch sub phase exposes.

We attempt to provide stored fields that are already loaded to the fields lookup that scripts as well as value fetchers use to load field values (via `SearchLookup`). This is done in `PreloadedFieldLookupProvider.` The current logic makes available values for fields that have been found, so that scripts or value fetchers that request them don't load them again ad-hoc. What happens though for stored fields that don't have a value for a specific doc, is that they are treated like any other field that was not requested, and loaded again, although they will not be found, which causes overhead.

This change makes available to `PreloadedFieldLookupProvider` the list of required stored fields, so that it can better distinguish between fields that we already attempted to load (although we may not have found a value for them) and those that need to be loaded ad-hoc (for instance because a script is requesting them for the first time).

This is an existing issue, that has become evident as we moved fetching of metadata fields to `FetchFieldsPhase`, that relies on value fetchers, and hence on `SearchLookup`. We end up attempting to load default metadata fields (`_ignored` and `_routing`) twice when they are not present in a document, which makes us call `LeafReader#storedFields` additional times for the same document providing a `SingleFieldVisitor` that will never find a value.

Another existing issue that this PR fixes is for the `FetchFieldsPhase` to extend the `StoredFieldsSpec` that it exposes to include the metadata fields that the phase is now responsible for loading. That results in `_ignored` being included in the output of the debug stored fields section when profiling is enabled. The fact that it was previously missing is an existing bug (it was missing in `StoredFieldLoader#fieldsToLoad`).

Yet another existing issues that this PR fixes is that `_id` has been until now always loaded on demand when requested via fetch fields or script. That is because it is not part of the preloaded stored fields that the fetch phase passes over to the `PreloadedFieldLookupProvider`. That causes overhead as the field has already been loaded, and should not be loaded once again when explicitly requested.
@javanna javanna added v8.14.1 and removed v8.14.0 labels Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.14.1 v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants