-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count number of documents with at least one ignored field #109146
base: main
Are you sure you want to change the base?
Count number of documents with at least one ignored field #109146
Conversation
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
Hi @salvatore-campagna, I've created a changelog YAML for you. |
Hi @salvatore-campagna, I've updated the changelog YAML for you. |
return readerContext.reader().getSumDocFreq(IgnoredFieldMapper.NAME); | ||
} catch (IOException e) { | ||
logger.trace(() -> "IO error while getting the number of documents with ignored fields", e); | ||
} catch (UnsupportedOperationException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens for source only
indices which do not include inverted index and doc values.
|
||
private long tryGetNumberOfDocumentsWithIgnoredFields(final LeafReaderContext readerContext) { | ||
try { | ||
return readerContext.reader().getSumDocFreq(IgnoredFieldMapper.NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this invoke getDocCount()
instead? Which returns: Returns the number of documents that have at least one term for this field
, which matches more closely with the method name and what we are trying to include in the doc stats?
@@ -136,7 +136,6 @@ | |||
import static java.util.Collections.emptyMap; | |||
import static java.util.Collections.singletonList; | |||
import static org.elasticsearch.action.support.WriteRequest.RefreshPolicy.IMMEDIATE; | |||
import static org.elasticsearch.indices.cluster.AbstractIndicesClusterStateServiceTestCase.awaitIndexShardCloseAsyncTasks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will put this back.
@@ -211,6 +213,12 @@ public DocsStats docStats() { | |||
} | |||
} | |||
|
|||
public IgnoredFieldStats ignoredFieldStats() { | |||
try (Searcher searcher = acquireSearcher("ignored_field", SearcherScope.INTERNAL)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martijnvg this is the logic I was talking about. I had to add a new source
to avoid FrozenStorageDeciderIT#testScale
fail. If I use "doc_stats" as source if fails at the assertion
assert false : "doc stats are eagerly loaded";
in FrozenEngine#openSearcher
@@ -257,6 +257,7 @@ private Engine.Searcher openSearcher(String source, SearcherScope scope) throws | |||
assert false : "refresh_needed is always false"; | |||
case "segments": | |||
case "segments_stats": | |||
case "ignored_field": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martijnvg I added this entry which, if I understand correctly results in not opening a new searcher but just reusing one that is already open.
Full BWC started to fail after I merged |
I don't see a way to filter stats based on tier other than using the Filtering on it anyway returning ignored field stats only in case the tier preference is
|
Another option would be to just return an empty result when the request hits a node whose role is |
Summarising what we discussed in a meeting about how to proceed. We will implement this feature keeping the |
With this PR: #101373 we introduced doc values for the |
In addition to enabling aggregations, it also makes it faster to count documents that have a _ignored field with the exists query. Since the dataset quality page has a time filter, it makes more sense to use an exists query in combination with a range query on the timestamp rather than getting stats from an entire index. |
The reason we started looking into this approach is that we can't do read queries to collect telemetry with the current permission setup in Kibana. I realize that this is not a strong point in favor of doing it, back then our reasoning was that when it helps this use case and is a helpful feature in general, we should do it. |
When it comes to counting documents including at least one ignored field, aggregations might
result in poor performance because of having to aggregate too many documents over multiple
indices targeted either though index patterns or data streams.
As part of the effort behind making ingestion of logs more reliable and improve on explaining indexing
issues, we would like to provide users with the fraction of documents including at least one
ignored field. For this purpose we introduce a new index metric,
ignored_field
, to the index stats api which allows fetching statistics about ignored fields.Ignored field stats includes the following:
total_docs
: total number of documents in the indexdocs_with_ignored_fields
: total number of documents with at least one ignored fieldsum_doc_freq_terms_ignored_fields
: the sum of term frequencies for the_ignored
fieldThis solution has some drawbacks anyway:
IndexWriter
buffer waiting to be flushed to Lucene segmentsResolves #108092