Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataset quality] Add malformed docs column #170220

Closed
2 tasks done
yngrdyn opened this issue Oct 31, 2023 · 20 comments · Fixed by #172462
Closed
2 tasks done

[Dataset quality] Add malformed docs column #170220

yngrdyn opened this issue Oct 31, 2023 · 20 comments · Fixed by #172462
Assignees
Labels
Feature:Dataset Health Team:obs-ux-logs Observability Logs User Experience Team

Comments

@yngrdyn
Copy link
Contributor

yngrdyn commented Oct 31, 2023

📓 Summary

Show users an overview on the health of the data in their datasets by creating Malformed docs column.

image

✔️ Acceptance criteria

  • The datasets health page shows a column dedicated to Malformed docs.
  • This new column is sortable and should be the primary sorting parameter. The final sorting priority should stay as Malformed docs > Integration > Name
    derived from _ignored field.

💡 Implementation hints

Tasks

Preview Give feedback
@yngrdyn yngrdyn added Team:obs-ux-logs Observability Logs User Experience Team Feature:Dataset Health labels Oct 31, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

@weltenwort
Copy link
Member

weltenwort commented Nov 2, 2023

💭 Of all the queries on the page so far, the one to find the malformed docs is probably the most expensive. So I'd be careful to perform it eagerly and to make it sortable, since that would put it on the critical path of the page's loading process.

@yngrdyn yngrdyn changed the title [Dataset health] Add malformed docs column [Dataset quality] Add malformed docs column Nov 9, 2023
@yngrdyn yngrdyn self-assigned this Nov 22, 2023
@yngrdyn
Copy link
Contributor Author

yngrdyn commented Nov 22, 2023

I'm trying to get the right query to get this information, so far I've come to

GET _search
{
  "size": 0,
  "query": {
    "bool": { 
      "filter": [
        {
          "range":{
            "@timestamp":{
              "gte": "now-1d",
              "lt" : "now",
              "format":"epoch_millis"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "dataset": {
      "terms": {
       "field": "data_stream.dataset"
      },
     "aggs": {
       "namespace": {
         "terms": {
           "field": "data_stream.namespace"
          },
          "aggs": {
            "malformed": {
              "filter": { 
                "exists": { 
                  "field": "_ignored" 
                }  
              }
            }
          }
       }
     }
    }
  }
}
Results are looking like this
  {
  "took": 28087,
  "timed_out": false,
  "num_reduce_phases": 5,
  "_shards": {
    "total": 5206,
    "successful": 5206,
    "skipped": 2920,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "dataset": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 9513175,
      "buckets": [
        {
          "key": "generic-pods",
          "doc_count": 30730929,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 30730929,
                "malformed": {
                  "doc_count": 25207
                }
              }
            ]
          }
        },
        {
          "key": "apm",
          "doc_count": 2561323,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 2561323,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "kubernetes.volume",
          "doc_count": 2316952,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 2316952,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "kubernetes.container",
          "doc_count": 1463593,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1463593,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "system.diskio",
          "doc_count": 1377213,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1377213,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "apm.service_destination.1m",
          "doc_count": 1293688,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1293688,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "kubernetes.pod",
          "doc_count": 1260213,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1260213,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "system.network",
          "doc_count": 1210065,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1210065,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "kubernetes.state_pod",
          "doc_count": 1015309,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1015309,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "kubernetes.state_container",
          "doc_count": 999728,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 999728,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Am I in the right path @felixbarny, @ruflin, @weltenwort ?

@weltenwort
Copy link
Member

I'd say it looks correct 👍 aside from the correctness I was wondering what you make of these points:

  • The terms agg is subject to statistical uncertainties (see Document count error). If completeness and correctness is important, do we want to use a composite agg in data_stream.dataset and data_stream.namespace instead?
  • Have you considered adding more criteria to the main query to allow ES to exclude more indices earlier? I'm thinking of of a term clause for data_stream.type == "logs" or similar.

@yngrdyn
Copy link
Contributor Author

yngrdyn commented Nov 22, 2023

I'm thinking of of a term clause for data_stream.type == "logs"

I wanted to consult about this one, all logs will have this data_stream.type == logs? If that's the case we can optimize the query, I tried earlier to set up GET logs-*/_search but was getting less results

do we want to use a composite agg in data_stream.dataset and data_stream.namespace instead?

didn't think about this one, thanks for bring it up! I'll explore this option

@ruflin
Copy link
Contributor

ruflin commented Nov 23, 2023

GET logs-*/_search but was getting less results

It should be logs-*-*. My guess is that you get less results is because with the above you also include metrics-*-* which we shouldn't (so far).

@yngrdyn
Copy link
Contributor Author

yngrdyn commented Nov 23, 2023

It should be logs--. My guess is that you get less results is because with the above you also include metrics-- which we shouldn't (so far).

I think you are right there.

Trying out including a filter like { "term": { "data_stream.type": "logs" } } and using indices logs-*-* is returning the same results.

@ruflin
Copy link
Contributor

ruflin commented Nov 23, 2023

Trying out including a filter like { "term": { "data_stream.type": "logs" } } and using indices logs-- is returning the same results.

Good to hear. Performance is almost identical, logs-*-* will have a very slight edge.

@yngrdyn
Copy link
Contributor Author

yngrdyn commented Nov 23, 2023

@weltenwort this is how the query is looking with the composite agg + targetting only logs-*-* indices

GET logs-*-*/_search
{
  "size": 0,
  "query": {
    "bool": { 
      "filter": [
        {
          "range":{
            "@timestamp":{
              "gte": "now-1d",
              "lt" : "now",
              "format":"epoch_millis"
            }
          }
        } 
      ]
    }
  },
  "aggs": {
    "datasets": {
      "composite": {
        "sources": [
          { "dataset": { "terms": { "field": "data_stream.dataset" } } },
          { "namespace": { "terms": { "field": "data_stream.namespace" } } }
        ]
      },
      "aggs": {
        "malformed": {
          "filter": { 
            "exists": { 
              "field": "_ignored" 
            }  
          }
        }
      }
    }
  }
}
Results are looking like this
{
  "took": 4963,
  "timed_out": false,
  "_shards": {
    "total": 20,
    "successful": 20,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "datasets": {
      "after_key": {
        "dataset": "generic-pods",
        "namespace": "default"
      },
      "buckets": [
        {
          "key": {
            "dataset": "apm.app.opbeans_android",
            "namespace": "default"
          },
          "doc_count": 723,
          "malformed": {
            "doc_count": 0
          }
        },
        {
          "key": {
            "dataset": "apm.app.opbeans_swift",
            "namespace": "default"
          },
          "doc_count": 338,
          "malformed": {
            "doc_count": 0
          }
        },
        {
          "key": {
            "dataset": "apm.error",
            "namespace": "default"
          },
          "doc_count": 135205,
          "malformed": {
            "doc_count": 120
          }
        },
        {
          "key": {
            "dataset": "elastic_agent",
            "namespace": "default"
          },
          "doc_count": 117445,
          "malformed": {
            "doc_count": 0
          }
        },
        {
          "key": {
            "dataset": "elastic_agent.filebeat",
            "namespace": "default"
          },
          "doc_count": 716525,
          "malformed": {
            "doc_count": 0
          }
        },
        {
          "key": {
            "dataset": "elastic_agent.metricbeat",
            "namespace": "default"
          },
          "doc_count": 426190,
          "malformed": {
            "doc_count": 0
          }
        },
        {
          "key": {
            "dataset": "generic-pods",
            "namespace": "default"
          },
          "doc_count": 38732725,
          "malformed": {
            "doc_count": 49273
          }
        }
      ]
    }
  }
}

I performed a search with the old (nested term aggregations) and the new one (composite aggregations) and the results where in fact different

image

@mohamedhamed-ahmed
Copy link
Contributor

@isaclfreire @ruflin
A couple of question regarding the UI :

  1. For the column values, do we have an indication of what red, orange, green maps to in terms of percentages?
  2. Are we still showing percentages as values or only poor, degraded, good? Because I remember this was discussed before

@yngrdyn
Copy link
Contributor Author

yngrdyn commented Nov 23, 2023

Are we still showing percentages as values or only poor, degraded, good? Because I remember this was discussed before

I remember we discussed to remove the percentage because we were talking before about aggregated rows and having a way for the user to click and open the flyout for more details about specific namespaces but now that we are showing every dataset+namespace combination as a row I think it would be helpful and not confusing to show the percentages

@isaclfreire
Copy link

isaclfreire commented Nov 23, 2023

now that we are showing every dataset+namespace combination as a row I think it would be helpful and not confusing to show the percentages

Agreed, I also suggest to keep the percentages as they are specific to the dataset, not part of an aggregation, so more straight forward.

Me and @ruflin had briefly discussed along the lines of this proposal:

  • Green <1%
  • Yellow - from 1 to 3%
  • Red - 3% up

Does this still make sense @ruflin?

cc @mdbirnstiehl

@ruflin
Copy link
Contributor

ruflin commented Nov 24, 2023

I would go with the values put in by @isaclfreire for now. Only thing I would change is that `Green == 0, Yellow: Between 0 and 3%. Like this, green only shows up if there really no errors.

We will have to tune these values as soon as we look at it with production data. This will give us a much better understanding on what makes most sense.

@felixbarny
Copy link
Member

How does this change once we not only look at degraded documents (_ignored) but also documents from the failure store?

@ruflin
Copy link
Contributor

ruflin commented Nov 24, 2023

I currently think of the failure store as a potential second column with # of docs. But if it is >0 for the time range, things are definitively red.

@weltenwort
Copy link
Member

@yngrdyn the composite query LGTM 👏 Are you planning to implement pagination through all the buckets or is the idea to put an upper limit on the numer?

@yngrdyn
Copy link
Contributor Author

yngrdyn commented Nov 29, 2023

@weltenwort I was definitely thinking in implementing pagination through all the buckets but my gut is telling me we might need to put a limit. wdyt?

@ruflin do we have data on the average number of datasets users have? This would be useful to take a decision whether to put a limit or go with the flow and try to get all the buckets.

@ruflin
Copy link
Contributor

ruflin commented Nov 30, 2023

@ruflin do we have data on the average number of datasets users have? This would be useful to take a decision whether to put a limit or go with the flow and try to get all the buckets.

It has been some time that I looked at the data, need to find a fresh sample. Historically, users had ~10-100 dataset but there are of course users where this number can be a multiple higher. In general, "too" many datasets has direct implications on how many indices and shards are there, and from a certain size Elasticsearch is also not too happy about it. For now I would follow the assumption, the number is lower <10k also in the extreme cases.

For now, I suggest we get all the buckets and see where we hit the limits.

@felixbarny
Copy link
Member

Is there a way we can implement this with pagination rather than getting all buckets? If yes, how much more complex would that be?

@yngrdyn
Copy link
Contributor Author

yngrdyn commented Nov 30, 2023

Is there a way we can implement this with pagination rather than getting all buckets? If yes, how much more complex would that be?

I can paginate the request for getting malformed docs using after_key coming from the result of the query I posted above, but this pagination wouldn't be a complete pagination from UI perspective, to get the other stats of the dataset we are using the dataStreams api which doesn't allow pagination. In practice, this would mean we would get all data streams from dataStreams api and we would get all information about _ignored property and combine those two results in the UI using the dataset name.

@yngrdyn yngrdyn linked a pull request Dec 4, 2023 that will close this issue
yngrdyn added a commit that referenced this issue Dec 5, 2023
Closes #170220.

### Changes
- New endpoint added to query malformed docs in elasticsearch `GET
/internal/dataset_quality/data_streams/malformed_docs`
- Decoded response from apis in `data_streams_stats_client.ts` as
suggested by @tonyghiani in
#171777.
- New synthtrace scenario, malformed logs, where we ingest documents
that will have `_ignored` properties.
- Malformed Docs column was added to `columns.tsx`.

#### Demo


https://github.com/elastic/kibana/assets/1313018/07a76f13-a837-4621-9366-63053a51b489

### How to test?
1. Go to
https://yngrdyn-deploy-kiban-pr172462.kb.us-west2.gcp.elastic-cloud.com/app/observability-log-explorer/dataset-quality
2. `Malformed docs` column should be present and should be sortable

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Dataset Health Team:obs-ux-logs Observability Logs User Experience Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants