Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count number of documents with at least one ignored field #109146

Open
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

salvatore-campagna
Copy link
Contributor

@salvatore-campagna salvatore-campagna commented May 29, 2024

When it comes to counting documents including at least one ignored field, aggregations might
result in poor performance because of having to aggregate too many documents over multiple
indices targeted either though index patterns or data streams.

As part of the effort behind making ingestion of logs more reliable and improve on explaining indexing
issues, we would like to provide users with the fraction of documents including at least one
ignored field. For this purpose we introduce a new index metric, ignored_field, to the index stats api which allows fetching statistics about ignored fields.

Ignored field stats includes the following:

  • total_docs: total number of documents in the index
  • docs_with_ignored_fields: total number of documents with at least one ignored field
  • sum_doc_freq_terms_ignored_fields: the sum of term frequencies for the _ignored field

This solution has some drawbacks anyway:

  1. it ignores deleted documents (it would need to filter out and subtract deleted documents which would make stats latency worse)
  2. it ignores documents which are still in the IndexWriter buffer waiting to be flushed to Lucene segments

Resolves #108092

@salvatore-campagna salvatore-campagna added test-full-bwc Trigger full BWC version matrix tests :StorageEngine/Logs You know, for Logs labels May 29, 2024
@salvatore-campagna salvatore-campagna self-assigned this May 29, 2024
@salvatore-campagna salvatore-campagna marked this pull request as ready for review May 29, 2024 08:24
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @salvatore-campagna, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @salvatore-campagna, I've updated the changelog YAML for you.

@salvatore-campagna salvatore-campagna requested a review from a team as a code owner May 29, 2024 08:34
return readerContext.reader().getSumDocFreq(IgnoredFieldMapper.NAME);
} catch (IOException e) {
logger.trace(() -> "IO error while getting the number of documents with ignored fields", e);
} catch (UnsupportedOperationException e) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens for source only indices which do not include inverted index and doc values.


private long tryGetNumberOfDocumentsWithIgnoredFields(final LeafReaderContext readerContext) {
try {
return readerContext.reader().getSumDocFreq(IgnoredFieldMapper.NAME);
Copy link
Member

@martijnvg martijnvg May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this invoke getDocCount() instead? Which returns: Returns the number of documents that have at least one term for this field, which matches more closely with the method name and what we are trying to include in the doc stats?

@@ -136,7 +136,6 @@
import static java.util.Collections.emptyMap;
import static java.util.Collections.singletonList;
import static org.elasticsearch.action.support.WriteRequest.RefreshPolicy.IMMEDIATE;
import static org.elasticsearch.indices.cluster.AbstractIndicesClusterStateServiceTestCase.awaitIndexShardCloseAsyncTasks;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will put this back.

@@ -211,6 +213,12 @@ public DocsStats docStats() {
}
}

public IgnoredFieldStats ignoredFieldStats() {
try (Searcher searcher = acquireSearcher("ignored_field", SearcherScope.INTERNAL)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg this is the logic I was talking about. I had to add a new source to avoid FrozenStorageDeciderIT#testScale fail. If I use "doc_stats" as source if fails at the assertion

assert false : "doc stats are eagerly loaded";

in FrozenEngine#openSearcher

@@ -257,6 +257,7 @@ private Engine.Searcher openSearcher(String source, SearcherScope scope) throws
assert false : "refresh_needed is always false";
case "segments":
case "segments_stats":
case "ignored_field":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg I added this entry which, if I understand correctly results in not opening a new searcher but just reusing one that is already open.

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Jun 4, 2024

Full BWC started to fail after I merged main I think something has been merged introducing some serialization issue. Probably another PR was merged without testing for full BWC.

@salvatore-campagna
Copy link
Contributor Author

I don't see a way to filter stats based on tier other than using the tier_preference.

Filtering on it anyway returning ignored field stats only in case the tier preference is data_hot, data_warm or data_content means that:

  • we would be able to get such stats only for data streams
  • we would be able to get such stats only for indices created by version 8 and above since the tier preference is required and a default value is injected only for indices created by version 8 and above

@salvatore-campagna
Copy link
Contributor Author

Another option would be to just return an empty result when the request hits a node whose role is data_frozen or data_cold. This way in a cluster hitting multiple nodes with different (data) roles we would return nothing from nodes with slow storage and the actual stats from other nodes like data_hot, data_warm and data_content.

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Jun 6, 2024

Summarising what we discussed in a meeting about how to proceed.

We will implement this feature keeping the ignored_field stats request and adding a filter that, by default, filters out results when fetching stats whose fetching might be slow (data_frozen, data_cold, and snapshots). We will give users an option to eventually ask for ignored field stats for such (slow) cases too, adding an additional option...something like allow_slow_queries=true.

@salvatore-campagna
Copy link
Contributor Author

With this PR: #101373 we introduced doc values for the _ignored field. That should make the aggregation on _ignored field faster (for new indices).

@felixbarny
Copy link
Member

In addition to enabling aggregations, it also makes it faster to count documents that have a _ignored field with the exists query. Since the dataset quality page has a time filter, it makes more sense to use an exists query in combination with a range query on the timestamp rather than getting stats from an entire index.

@flash1293
Copy link
Contributor

The reason we started looking into this approach is that we can't do read queries to collect telemetry with the current permission setup in Kibana. I realize that this is not a strong point in favor of doing it, back then our reasoning was that when it helps this use case and is a helpful feature in general, we should do it.

@breskeby breskeby removed the request for review from a team February 21, 2025 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :StorageEngine/Logs You know, for Logs Team:StorageEngine test-full-bwc Trigger full BWC version matrix tests v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_ignored meta field index stats
6 participants