-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dataset quality] Add malformed docs column #170220
Comments
Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs) |
💭 Of all the queries on the page so far, the one to find the malformed docs is probably the most expensive. So I'd be careful to perform it eagerly and to make it sortable, since that would put it on the critical path of the page's loading process. |
I'm trying to get the right query to get this information, so far I've come to
Results are looking like this
Am I in the right path @felixbarny, @ruflin, @weltenwort ? |
I'd say it looks correct 👍 aside from the correctness I was wondering what you make of these points:
|
I wanted to consult about this one, all logs will have this
didn't think about this one, thanks for bring it up! I'll explore this option |
It should be |
I think you are right there. Trying out including a filter like |
Good to hear. Performance is almost identical, |
@weltenwort this is how the query is looking with the composite agg + targetting only
Results are looking like this
I performed a search with the old (nested term aggregations) and the new one (composite aggregations) and the results where in fact different ![]() |
@isaclfreire @ruflin
|
I remember we discussed to remove the percentage because we were talking before about aggregated rows and having a way for the user to click and open the flyout for more details about specific namespaces but now that we are showing every |
Agreed, I also suggest to keep the percentages as they are specific to the dataset, not part of an aggregation, so more straight forward. Me and @ruflin had briefly discussed along the lines of this proposal:
Does this still make sense @ruflin? |
I would go with the values put in by @isaclfreire for now. Only thing I would change is that `Green == 0, Yellow: Between 0 and 3%. Like this, green only shows up if there really no errors. We will have to tune these values as soon as we look at it with production data. This will give us a much better understanding on what makes most sense. |
How does this change once we not only look at degraded documents (_ignored) but also documents from the failure store? |
I currently think of the failure store as a potential second column with # of docs. But if it is >0 for the time range, things are definitively red. |
@yngrdyn the composite query LGTM 👏 Are you planning to implement pagination through all the buckets or is the idea to put an upper limit on the numer? |
@weltenwort I was definitely thinking in implementing pagination through all the buckets but my gut is telling me we might need to put a limit. wdyt? @ruflin do we have data on the average number of datasets users have? This would be useful to take a decision whether to put a limit or go with the flow and try to get all the buckets. |
It has been some time that I looked at the data, need to find a fresh sample. Historically, users had ~10-100 dataset but there are of course users where this number can be a multiple higher. In general, "too" many datasets has direct implications on how many indices and shards are there, and from a certain size Elasticsearch is also not too happy about it. For now I would follow the assumption, the number is lower <10k also in the extreme cases. For now, I suggest we get all the buckets and see where we hit the limits. |
Is there a way we can implement this with pagination rather than getting all buckets? If yes, how much more complex would that be? |
I can paginate the request for getting malformed docs using |
Closes #170220. ### Changes - New endpoint added to query malformed docs in elasticsearch `GET /internal/dataset_quality/data_streams/malformed_docs` - Decoded response from apis in `data_streams_stats_client.ts` as suggested by @tonyghiani in #171777. - New synthtrace scenario, malformed logs, where we ingest documents that will have `_ignored` properties. - Malformed Docs column was added to `columns.tsx`. #### Demo https://github.com/elastic/kibana/assets/1313018/07a76f13-a837-4621-9366-63053a51b489 ### How to test? 1. Go to https://yngrdyn-deploy-kiban-pr172462.kb.us-west2.gcp.elastic-cloud.com/app/observability-log-explorer/dataset-quality 2. `Malformed docs` column should be present and should be sortable --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
📓 Summary
Show users an overview on the
health
of the data in their datasets by creatingMalformed docs
column.✔️ Acceptance criteria
Malformed docs
.Malformed docs
>Integration
>Name
derived from
_ignored
field.💡 Implementation hints
Tasks
The text was updated successfully, but these errors were encountered: