Count number of documents with at least one ignored field #109146

salvatore-campagna · 2024-05-29T08:23:57Z

When it comes to counting documents including at least one ignored field, aggregations might
result in poor performance because of having to aggregate too many documents over multiple
indices targeted either though index patterns or data streams.

As part of the effort behind making ingestion of logs more reliable and improve on explaining indexing
issues, we would like to provide users with the fraction of documents including at least one
ignored field. For this purpose we introduce a new index metric, ignored_field, to the index stats api which allows fetching statistics about ignored fields.

Ignored field stats includes the following:

total_docs: total number of documents in the index
docs_with_ignored_fields: total number of documents with at least one ignored field
sum_doc_freq_terms_ignored_fields: the sum of term frequencies for the _ignored field

This solution has some drawbacks anyway:

it ignores deleted documents (it would need to filter out and subtract deleted documents which would make stats latency worse)
it ignores documents which are still in the IndexWriter buffer waiting to be flushed to Lucene segments

Resolves #108092

elasticsearchmachine · 2024-05-29T08:24:27Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2024-05-29T08:25:18Z

Hi @salvatore-campagna, I've created a changelog YAML for you.

elasticsearchmachine · 2024-05-29T08:27:10Z

Hi @salvatore-campagna, I've updated the changelog YAML for you.

salvatore-campagna · 2024-05-30T08:19:09Z

server/src/main/java/org/elasticsearch/index/engine/Engine.java

+            return readerContext.reader().getSumDocFreq(IgnoredFieldMapper.NAME);
+        } catch (IOException e) {
+            logger.trace(() -> "IO error while getting the number of documents with ignored fields", e);
+        } catch (UnsupportedOperationException e) {


This happens for source only indices which do not include inverted index and doc values.

This reverts commit aebeb19.

martijnvg · 2024-05-30T09:01:53Z

server/src/main/java/org/elasticsearch/index/engine/Engine.java

+
+    private long tryGetNumberOfDocumentsWithIgnoredFields(final LeafReaderContext readerContext) {
+        try {
+            return readerContext.reader().getSumDocFreq(IgnoredFieldMapper.NAME);


Should this invoke getDocCount() instead? Which returns: Returns the number of documents that have at least one term for this field, which matches more closely with the method name and what we are trying to include in the doc stats?

salvatore-campagna · 2024-06-03T13:49:07Z

server/src/test/java/org/elasticsearch/search/SearchServiceTests.java

@@ -136,7 +136,6 @@
 import static java.util.Collections.emptyMap;
 import static java.util.Collections.singletonList;
 import static org.elasticsearch.action.support.WriteRequest.RefreshPolicy.IMMEDIATE;
-import static org.elasticsearch.indices.cluster.AbstractIndicesClusterStateServiceTestCase.awaitIndexShardCloseAsyncTasks;


Will put this back.

salvatore-campagna · 2024-06-03T14:04:57Z

server/src/main/java/org/elasticsearch/index/engine/Engine.java

@@ -211,6 +213,12 @@ public DocsStats docStats() {
        }
    }

+    public IgnoredFieldStats ignoredFieldStats() {
+        try (Searcher searcher = acquireSearcher("ignored_field", SearcherScope.INTERNAL)) {


@martijnvg this is the logic I was talking about. I had to add a new source to avoid FrozenStorageDeciderIT#testScale fail. If I use "doc_stats" as source if fails at the assertion

assert false : "doc stats are eagerly loaded";

in FrozenEngine#openSearcher

salvatore-campagna · 2024-06-03T14:27:18Z

x-pack/plugin/core/src/main/java/org/elasticsearch/index/engine/frozen/FrozenEngine.java

@@ -257,6 +257,7 @@ private Engine.Searcher openSearcher(String source, SearcherScope scope) throws
                assert false : "refresh_needed is always false";
            case "segments":
            case "segments_stats":
+            case "ignored_field":


@martijnvg I added this entry which, if I understand correctly results in not opening a new searcher but just reusing one that is already open.

salvatore-campagna · 2024-06-04T12:41:15Z

Full BWC started to fail after I merged main I think something has been merged introducing some serialization issue. Probably another PR was merged without testing for full BWC.

salvatore-campagna · 2024-06-05T11:10:57Z

I don't see a way to filter stats based on tier other than using the tier_preference.

Filtering on it anyway returning ignored field stats only in case the tier preference is data_hot, data_warm or data_content means that:

we would be able to get such stats only for data streams
we would be able to get such stats only for indices created by version 8 and above since the tier preference is required and a default value is injected only for indices created by version 8 and above

salvatore-campagna · 2024-06-05T14:14:56Z

Another option would be to just return an empty result when the request hits a node whose role is data_frozen or data_cold. This way in a cluster hitting multiple nodes with different (data) roles we would return nothing from nodes with slow storage and the actual stats from other nodes like data_hot, data_warm and data_content.

salvatore-campagna · 2024-06-06T06:55:15Z

Summarising what we discussed in a meeting about how to proceed.

We will implement this feature keeping the ignored_field stats request and adding a filter that, by default, filters out results when fetching stats whose fetching might be slow (data_frozen, data_cold, and snapshots). We will give users an option to eventually ask for ignored field stats for such (slow) cases too, adding an additional option...something like allow_slow_queries=true.

salvatore-campagna · 2025-02-06T15:54:45Z

With this PR: #101373 we introduced doc values for the _ignored field. That should make the aggregation on _ignored field faster (for new indices).

felixbarny · 2025-02-06T16:18:57Z

In addition to enabling aggregations, it also makes it faster to count documents that have a _ignored field with the exists query. Since the dataset quality page has a time filter, it makes more sense to use an exists query in combination with a range query on the timestamp rather than getting stats from an entire index.

flash1293 · 2025-02-06T16:45:23Z

The reason we started looking into this approach is that we can't do read queries to collect telemetry with the current permission setup in Kibana. I realize that this is not a strong point in favor of doing it, back then our reasoning was that when it helps this use case and is a helpful feature in general, we should do it.

feature: count number of documents with at least one ignored field

ea1d2c8

salvatore-campagna added test-full-bwc Trigger full BWC version matrix tests :StorageEngine/Logs You know, for Logs labels May 29, 2024

salvatore-campagna self-assigned this May 29, 2024

salvatore-campagna marked this pull request as ready for review May 29, 2024 08:24

elasticsearchmachine added Team:StorageEngine v8.15.0 labels May 29, 2024

salvatore-campagna added the >feature label May 29, 2024

Update docs/changelog/109146.yaml

5b0bf9c

Update docs/changelog/109146.yaml

0516fa7

fix: adding missing 'Logs' to changelog schema

b468362

salvatore-campagna requested a review from a team as a code owner May 29, 2024 08:34

salvatore-campagna added 10 commits May 29, 2024 11:05

fix: constructor invokation

46571dd

fix: constructor invokation

7419e0b

docs: update docs stats page

4fffc73

fix: skip null check

0d88613

fix: add missing docs_with_ignored_fields

50911a7

fix: a few more tests

19e2bc2

Merge branch 'main' into feature/108092-docs-with-ignored-fields

0952941

fix: update transport version id after main merge

9ed9505

fix: use -1 as init value

aebeb19

Merge branch 'main' into feature/108092-docs-with-ignored-fields

c9639d7

salvatore-campagna commented May 30, 2024

View reviewed changes

salvatore-campagna added 3 commits May 30, 2024 10:21

note: souce only snapshot unsupported

b0f61ae

Revert "fix: use -1 as init value"

8c92be8

This reverts commit aebeb19.

nit: remove this

bb74c2f

martijnvg reviewed May 30, 2024

View reviewed changes

fix: use ignored field source when acquiring searcher

33b943e

salvatore-campagna commented Jun 3, 2024

View reviewed changes

salvatore-campagna added 2 commits June 3, 2024 16:11

fix: restore unwanted changes

b4f4748

fix: reuse an already open searcher for searchable snapshot

e656e7c

salvatore-campagna commented Jun 3, 2024

View reviewed changes

salvatore-campagna added 5 commits June 3, 2024 16:49

fix: remove docs from cluster stats

d842108

fix: make method names consistent

0ec7272

fix: explicitly initialize value

2ca4842

nit: rename capability

c663ff5

fix: remove unused code

573377d

salvatore-campagna requested a review from martijnvg June 3, 2024 15:07

salvatore-campagna added 3 commits June 4, 2024 08:51

Merge branch 'main' into feature/108092-docs-with-ignored-fields

bbb9cae

docs: improve docs

4a6e586

Merge branch 'main' into feature/108092-docs-with-ignored-fields

38da229

salvatore-campagna added 2 commits June 4, 2024 16:15

Merge branch 'main' into feature/108092-docs-with-ignored-fields

2b795ae

Merge branch 'main' into feature/108092-docs-with-ignored-fields

72e6935

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels Jul 4, 2024

mark-vieira added v9.0.0 and removed v8.16.0 labels Sep 11, 2024

elasticsearchmachine added v9.1.0 and removed v9.0.0 labels Jan 30, 2025

breskeby removed the request for review from a team February 21, 2025 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count number of documents with at least one ignored field #109146

Count number of documents with at least one ignored field #109146

salvatore-campagna commented May 29, 2024 •

edited

Loading

elasticsearchmachine commented May 29, 2024

elasticsearchmachine commented May 29, 2024

elasticsearchmachine commented May 29, 2024

salvatore-campagna May 30, 2024

martijnvg May 30, 2024 •

edited

Loading

salvatore-campagna Jun 3, 2024

salvatore-campagna Jun 3, 2024

salvatore-campagna Jun 3, 2024

salvatore-campagna commented Jun 4, 2024 •

edited

Loading

salvatore-campagna commented Jun 5, 2024

salvatore-campagna commented Jun 5, 2024

salvatore-campagna commented Jun 6, 2024 •

edited

Loading

salvatore-campagna commented Feb 6, 2025

felixbarny commented Feb 6, 2025

flash1293 commented Feb 6, 2025

Count number of documents with at least one ignored field #109146

Are you sure you want to change the base?

Count number of documents with at least one ignored field #109146

Conversation

salvatore-campagna commented May 29, 2024 • edited Loading

elasticsearchmachine commented May 29, 2024

elasticsearchmachine commented May 29, 2024

elasticsearchmachine commented May 29, 2024

salvatore-campagna May 30, 2024

Choose a reason for hiding this comment

martijnvg May 30, 2024 • edited Loading

Choose a reason for hiding this comment

salvatore-campagna Jun 3, 2024

Choose a reason for hiding this comment

salvatore-campagna Jun 3, 2024

Choose a reason for hiding this comment

salvatore-campagna Jun 3, 2024

Choose a reason for hiding this comment

salvatore-campagna commented Jun 4, 2024 • edited Loading

salvatore-campagna commented Jun 5, 2024

salvatore-campagna commented Jun 5, 2024

salvatore-campagna commented Jun 6, 2024 • edited Loading

salvatore-campagna commented Feb 6, 2025

felixbarny commented Feb 6, 2025

flash1293 commented Feb 6, 2025

salvatore-campagna commented May 29, 2024 •

edited

Loading

martijnvg May 30, 2024 •

edited

Loading

salvatore-campagna commented Jun 4, 2024 •

edited

Loading

salvatore-campagna commented Jun 6, 2024 •

edited

Loading