[Dataset quality] Add malformed docs column #170220

yngrdyn · 2023-10-31T11:30:04Z

📓 Summary

Show users an overview on the health of the data in their datasets by creating Malformed docs column.

🎨 Figma: https://www.figma.com/file/aaJNMsOklYhjwtu5TP7UDK/%5BLogs%2B%5D-Organize-and-Admin?node-id=2106%3A381912&mode=dev

✔️ Acceptance criteria

The datasets health page shows a column dedicated to Malformed docs.
This new column is sortable and should be the primary sorting parameter. The final sorting priority should stay as Malformed docs > Integration > Name
derived from _ignored field.

💡 Implementation hints

Ponder the idea of adding malformedDocs to new stats api created in [Dataset quality] Introduce Dataset health page #169759 and creating a new API just for querying this information.

Tasks

Give feedback

Create malformed stats api
Add malformed docs column to table
Options

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-10-31T11:30:06Z

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

weltenwort · 2023-11-02T12:49:01Z

💭 Of all the queries on the page so far, the one to find the malformed docs is probably the most expensive. So I'd be careful to perform it eagerly and to make it sortable, since that would put it on the critical path of the page's loading process.

yngrdyn · 2023-11-22T13:22:15Z

I'm trying to get the right query to get this information, so far I've come to

GET _search
{
  "size": 0,
  "query": {
    "bool": { 
      "filter": [
        {
          "range":{
            "@timestamp":{
              "gte": "now-1d",
              "lt" : "now",
              "format":"epoch_millis"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "dataset": {
      "terms": {
       "field": "data_stream.dataset"
      },
     "aggs": {
       "namespace": {
         "terms": {
           "field": "data_stream.namespace"
          },
          "aggs": {
            "malformed": {
              "filter": { 
                "exists": { 
                  "field": "_ignored" 
                }  
              }
            }
          }
       }
     }
    }
  }
}

Results are looking like this

  {
  "took": 28087,
  "timed_out": false,
  "num_reduce_phases": 5,
  "_shards": {
    "total": 5206,
    "successful": 5206,
    "skipped": 2920,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "dataset": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 9513175,
      "buckets": [
        {
          "key": "generic-pods",
          "doc_count": 30730929,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 30730929,
                "malformed": {
                  "doc_count": 25207
                }
              }
            ]
          }
        },
        {
          "key": "apm",
          "doc_count": 2561323,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 2561323,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "kubernetes.volume",
          "doc_count": 2316952,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 2316952,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "kubernetes.container",
          "doc_count": 1463593,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1463593,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "system.diskio",
          "doc_count": 1377213,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1377213,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "apm.service_destination.1m",
          "doc_count": 1293688,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1293688,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "kubernetes.pod",
          "doc_count": 1260213,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1260213,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "system.network",
          "doc_count": 1210065,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1210065,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "kubernetes.state_pod",
          "doc_count": 1015309,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 1015309,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        },
        {
          "key": "kubernetes.state_container",
          "doc_count": 999728,
          "namespace": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "default",
                "doc_count": 999728,
                "malformed": {
                  "doc_count": 0
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Am I in the right path @felixbarny, @ruflin, @weltenwort ?

weltenwort · 2023-11-22T17:40:33Z

I'd say it looks correct 👍 aside from the correctness I was wondering what you make of these points:

The terms agg is subject to statistical uncertainties (see Document count error). If completeness and correctness is important, do we want to use a composite agg in data_stream.dataset and data_stream.namespace instead?
Have you considered adding more criteria to the main query to allow ES to exclude more indices earlier? I'm thinking of of a term clause for data_stream.type == "logs" or similar.

yngrdyn · 2023-11-22T17:59:04Z

I'm thinking of of a term clause for data_stream.type == "logs"

I wanted to consult about this one, all logs will have this data_stream.type == logs? If that's the case we can optimize the query, I tried earlier to set up GET logs-*/_search but was getting less results

do we want to use a composite agg in data_stream.dataset and data_stream.namespace instead?

didn't think about this one, thanks for bring it up! I'll explore this option

ruflin · 2023-11-23T08:57:25Z

GET logs-*/_search but was getting less results

It should be logs-*-*. My guess is that you get less results is because with the above you also include metrics-*-* which we shouldn't (so far).

yngrdyn · 2023-11-23T14:29:54Z

It should be logs--. My guess is that you get less results is because with the above you also include metrics-- which we shouldn't (so far).

I think you are right there.

Trying out including a filter like { "term": { "data_stream.type": "logs" } } and using indices logs-*-* is returning the same results.

ruflin · 2023-11-23T14:36:50Z

Trying out including a filter like { "term": { "data_stream.type": "logs" } } and using indices logs-- is returning the same results.

Good to hear. Performance is almost identical, logs-*-* will have a very slight edge.

yngrdyn · 2023-11-23T14:59:47Z

@weltenwort this is how the query is looking with the composite agg + targetting only logs-*-* indices

GET logs-*-*/_search
{
  "size": 0,
  "query": {
    "bool": { 
      "filter": [
        {
          "range":{
            "@timestamp":{
              "gte": "now-1d",
              "lt" : "now",
              "format":"epoch_millis"
            }
          }
        } 
      ]
    }
  },
  "aggs": {
    "datasets": {
      "composite": {
        "sources": [
          { "dataset": { "terms": { "field": "data_stream.dataset" } } },
          { "namespace": { "terms": { "field": "data_stream.namespace" } } }
        ]
      },
      "aggs": {
        "malformed": {
          "filter": { 
            "exists": { 
              "field": "_ignored" 
            }  
          }
        }
      }
    }
  }
}

Results are looking like this

{
  "took": 4963,
  "timed_out": false,
  "_shards": {
    "total": 20,
    "successful": 20,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "datasets": {
      "after_key": {
        "dataset": "generic-pods",
        "namespace": "default"
      },
      "buckets": [
        {
          "key": {
            "dataset": "apm.app.opbeans_android",
            "namespace": "default"
          },
          "doc_count": 723,
          "malformed": {
            "doc_count": 0
          }
        },
        {
          "key": {
            "dataset": "apm.app.opbeans_swift",
            "namespace": "default"
          },
          "doc_count": 338,
          "malformed": {
            "doc_count": 0
          }
        },
        {
          "key": {
            "dataset": "apm.error",
            "namespace": "default"
          },
          "doc_count": 135205,
          "malformed": {
            "doc_count": 120
          }
        },
        {
          "key": {
            "dataset": "elastic_agent",
            "namespace": "default"
          },
          "doc_count": 117445,
          "malformed": {
            "doc_count": 0
          }
        },
        {
          "key": {
            "dataset": "elastic_agent.filebeat",
            "namespace": "default"
          },
          "doc_count": 716525,
          "malformed": {
            "doc_count": 0
          }
        },
        {
          "key": {
            "dataset": "elastic_agent.metricbeat",
            "namespace": "default"
          },
          "doc_count": 426190,
          "malformed": {
            "doc_count": 0
          }
        },
        {
          "key": {
            "dataset": "generic-pods",
            "namespace": "default"
          },
          "doc_count": 38732725,
          "malformed": {
            "doc_count": 49273
          }
        }
      ]
    }
  }
}

I performed a search with the old (nested term aggregations) and the new one (composite aggregations) and the results where in fact different

mohamedhamed-ahmed · 2023-11-23T15:30:22Z

@isaclfreire @ruflin
A couple of question regarding the UI :

For the column values, do we have an indication of what red, orange, green maps to in terms of percentages?
Are we still showing percentages as values or only poor, degraded, good? Because I remember this was discussed before

yngrdyn · 2023-11-23T15:52:03Z

Are we still showing percentages as values or only poor, degraded, good? Because I remember this was discussed before

I remember we discussed to remove the percentage because we were talking before about aggregated rows and having a way for the user to click and open the flyout for more details about specific namespaces but now that we are showing every dataset+namespace combination as a row I think it would be helpful and not confusing to show the percentages

isaclfreire · 2023-11-23T16:15:10Z

now that we are showing every dataset+namespace combination as a row I think it would be helpful and not confusing to show the percentages

Agreed, I also suggest to keep the percentages as they are specific to the dataset, not part of an aggregation, so more straight forward.

Me and @ruflin had briefly discussed along the lines of this proposal:

Green <1%
Yellow - from 1 to 3%
Red - 3% up

Does this still make sense @ruflin?

cc @mdbirnstiehl

ruflin · 2023-11-24T11:53:15Z

I would go with the values put in by @isaclfreire for now. Only thing I would change is that `Green == 0, Yellow: Between 0 and 3%. Like this, green only shows up if there really no errors.

We will have to tune these values as soon as we look at it with production data. This will give us a much better understanding on what makes most sense.

felixbarny · 2023-11-24T11:56:53Z

How does this change once we not only look at degraded documents (_ignored) but also documents from the failure store?

ruflin · 2023-11-24T11:59:02Z

I currently think of the failure store as a potential second column with # of docs. But if it is >0 for the time range, things are definitively red.

weltenwort · 2023-11-27T15:00:22Z

@yngrdyn the composite query LGTM 👏 Are you planning to implement pagination through all the buckets or is the idea to put an upper limit on the numer?

yngrdyn · 2023-11-29T09:32:49Z

@weltenwort I was definitely thinking in implementing pagination through all the buckets but my gut is telling me we might need to put a limit. wdyt?

@ruflin do we have data on the average number of datasets users have? This would be useful to take a decision whether to put a limit or go with the flow and try to get all the buckets.

ruflin · 2023-11-30T10:51:37Z

@ruflin do we have data on the average number of datasets users have? This would be useful to take a decision whether to put a limit or go with the flow and try to get all the buckets.

It has been some time that I looked at the data, need to find a fresh sample. Historically, users had ~10-100 dataset but there are of course users where this number can be a multiple higher. In general, "too" many datasets has direct implications on how many indices and shards are there, and from a certain size Elasticsearch is also not too happy about it. For now I would follow the assumption, the number is lower <10k also in the extreme cases.

For now, I suggest we get all the buckets and see where we hit the limits.

felixbarny · 2023-11-30T11:01:25Z

Is there a way we can implement this with pagination rather than getting all buckets? If yes, how much more complex would that be?

yngrdyn · 2023-11-30T11:19:57Z

Is there a way we can implement this with pagination rather than getting all buckets? If yes, how much more complex would that be?

I can paginate the request for getting malformed docs using after_key coming from the result of the query I posted above, but this pagination wouldn't be a complete pagination from UI perspective, to get the other stats of the dataset we are using the dataStreams api which doesn't allow pagination. In practice, this would mean we would get all data streams from dataStreams api and we would get all information about _ignored property and combine those two results in the UI using the dataset name.

@tonyghiani

Closes #170220. ### Changes - New endpoint added to query malformed docs in elasticsearch `GET /internal/dataset_quality/data_streams/malformed_docs` - Decoded response from apis in `data_streams_stats_client.ts` as suggested by @tonyghiani in #171777. - New synthtrace scenario, malformed logs, where we ingest documents that will have `_ignored` properties. - Malformed Docs column was added to `columns.tsx`. #### Demo https://github.com/elastic/kibana/assets/1313018/07a76f13-a837-4621-9366-63053a51b489 ### How to test? 1. Go to https://yngrdyn-deploy-kiban-pr172462.kb.us-west2.gcp.elastic-cloud.com/app/observability-log-explorer/dataset-quality 2. `Malformed docs` column should be present and should be sortable --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>

yngrdyn added Team:obs-ux-logs Observability Logs User Experience Team Feature:Dataset Health labels Oct 31, 2023

yngrdyn changed the title ~~[Dataset health] Add malformed docs column~~ [Dataset quality] Add malformed docs column Nov 9, 2023

yngrdyn self-assigned this Nov 22, 2023

yngrdyn mentioned this issue Dec 4, 2023

[Dataset quality] Added malformed docs column to table #172462

Merged

yngrdyn linked a pull request Dec 4, 2023 that will close this issue

[Dataset quality] Added malformed docs column to table #172462

Merged

yngrdyn closed this as completed in #172462 Dec 5, 2023

yngrdyn mentioned this issue Dec 20, 2023

[Dataset quality] degradedDocs percentage is rounded to 0% when the percentage is really low #173739

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dataset quality] Add malformed docs column #170220

[Dataset quality] Add malformed docs column #170220

yngrdyn commented Oct 31, 2023 •

edited

Loading

Tasks

elasticmachine commented Oct 31, 2023

weltenwort commented Nov 2, 2023 •

edited

Loading

yngrdyn commented Nov 22, 2023 •

edited

Loading

weltenwort commented Nov 22, 2023

yngrdyn commented Nov 22, 2023

ruflin commented Nov 23, 2023

yngrdyn commented Nov 23, 2023

ruflin commented Nov 23, 2023

yngrdyn commented Nov 23, 2023

mohamedhamed-ahmed commented Nov 23, 2023

yngrdyn commented Nov 23, 2023

isaclfreire commented Nov 23, 2023 •

edited

Loading

ruflin commented Nov 24, 2023

felixbarny commented Nov 24, 2023

ruflin commented Nov 24, 2023

weltenwort commented Nov 27, 2023

yngrdyn commented Nov 29, 2023

ruflin commented Nov 30, 2023

felixbarny commented Nov 30, 2023

yngrdyn commented Nov 30, 2023

[Dataset quality] Add malformed docs column #170220

[Dataset quality] Add malformed docs column #170220

Comments

yngrdyn commented Oct 31, 2023 • edited Loading

📓 Summary

✔️ Acceptance criteria

💡 Implementation hints

Tasks

elasticmachine commented Oct 31, 2023

weltenwort commented Nov 2, 2023 • edited Loading

yngrdyn commented Nov 22, 2023 • edited Loading

weltenwort commented Nov 22, 2023

yngrdyn commented Nov 22, 2023

ruflin commented Nov 23, 2023

yngrdyn commented Nov 23, 2023

ruflin commented Nov 23, 2023

yngrdyn commented Nov 23, 2023

mohamedhamed-ahmed commented Nov 23, 2023

yngrdyn commented Nov 23, 2023

isaclfreire commented Nov 23, 2023 • edited Loading

ruflin commented Nov 24, 2023

felixbarny commented Nov 24, 2023

ruflin commented Nov 24, 2023

weltenwort commented Nov 27, 2023

yngrdyn commented Nov 29, 2023

ruflin commented Nov 30, 2023

felixbarny commented Nov 30, 2023

yngrdyn commented Nov 30, 2023

yngrdyn commented Oct 31, 2023 •

edited

Loading

weltenwort commented Nov 2, 2023 •

edited

Loading

yngrdyn commented Nov 22, 2023 •

edited

Loading

isaclfreire commented Nov 23, 2023 •

edited

Loading