Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EEM] Remove duplicates from latest data set #187699

Merged
merged 12 commits into from
Jul 12, 2024

Conversation

miltonhultgren
Copy link
Contributor

@miltonhultgren miltonhultgren commented Jul 5, 2024

By only grouping on entity.id we should be able to remove duplicates in the latest indices.
This PR also removes the values found for entity.identityFields and replaces it with a list of those field names.
This PR also lifts the values for the identity fields to the root of the document.
This PR removes the displayName from the historical documents.

How to test

Source data:

PUT index_a
{
  "mappings": {
    "properties": {
      "a": {
        "type": "keyword"
      },
      "@timestamp": {
        "type": "date"
      }
    }
  }
}

PUT index_b
{
  "mappings": {
    "properties": {
      "b": {
        "type": "keyword"
      },
      "@timestamp": {
        "type": "date"
      }
    }
  }
}

POST index_a/_doc
{
  "a": "same",
  "@timestamp": "2024-07-05T12:33:06.162Z"
}

POST index_b/_doc
{
  "b": "same",
  "@timestamp": "2024-07-05T12:33:06.162Z"
}

Entity definition:

POST kbn:/internal/api/entities/definition
{
  "id": "bucket_key",
  "name": "Bucket key",
  "type": "service",
  "indexPatterns": [
    "index_*"
  ],
  "timestampField": "@timestamp",
  "lookback": "5m",
  "identityFields": [
    {
      "field": "a",
      "optional": true
    },
    {
      "field": "b",
      "optional": true
    }
  ],
  "displayNameTemplate": "{{a}}{{b}}",
  "history": {
    "timestampField": "@timestamp",
    "interval": "5m"
  }
}

Change in the format of the resulting documents

"identityFields": {
  "a": null,
  "b": "same"
},

=>

"identityFields": [
  "a",
  "b"
],

@obltmachine
Copy link

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@miltonhultgren miltonhultgren self-assigned this Jul 5, 2024
Copy link
Contributor

@tommyers-elastic tommyers-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the right direction

can we also remove displayName from the history docs (since it should only use one (and maybe not the one in the current doc) identityField), and add it to the latest docs

@miltonhultgren
Copy link
Contributor Author

I'm going to split this PR in two:

  1. Lifting identifyFields up to root in both datasets
  2. Removing duplication in the latest dataset with tests that use DataForge for data generation

@tommyers-elastic tommyers-elastic added the Feature:EEM Elastic Entity Model label Jul 8, 2024
@miltonhultgren miltonhultgren force-pushed the eem-deduplicate-latest branch from 74b1e91 to e5c3f8b Compare July 9, 2024 15:54
@miltonhultgren
Copy link
Contributor Author

can we also remove displayName from the history docs (since it should only use one (and maybe not the one in the current doc) identityField), and add it to the latest docs

@tommyers-elastic
So, in the history docs we would track the values of the identity fields (by storing them at the root).
And then we'd grab those values in the latest transform, and expect to find a single value and use that value to create the display name?

Are we sure we won't need the display name for the history documents? In the UI I guess not because we'll likely enter the history from the latest, so we can hang on to the display name from there.

But what about if there are more than one value found in the latest transform, I'm still not sure how to handle that.
I almost feel like it would be correct to have the ingest pipeline throw errors for that.
What do you think? (In either case, I'll likely address this in the next PR)

Also, this PR is ready for review again!

@miltonhultgren miltonhultgren marked this pull request as ready for review July 9, 2024 16:05
@miltonhultgren miltonhultgren requested a review from a team as a code owner July 9, 2024 16:05
@botelastic botelastic bot added the ci:project-deploy-observability Create an Observability project label Jul 9, 2024
@miltonhultgren miltonhultgren added release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting and removed ci:project-deploy-observability Create an Observability project labels Jul 9, 2024
@tommyers-elastic
Copy link
Contributor

@miltonhultgren the reason for displayName in latest only is because of:

say you have an entity definition with two fields that both contain the user identifier, user.name and labels.user_name; in the history documents you have one or the other. so you might construct a template like {{#user.name}}{{.}}{{/user.name}}{{#labels.user_name}}{{.}}{{/labels.user_name}}. but then when these are combined in the latests docs and both identity fields exist in the document, you get a display name with both fields populated.

so you just put a display name with one of the fields like {{user.name}}. if you include display name in the history documents, only some of them get a value populated, but now in the latest transform you get the correct value.

@botelastic botelastic bot added the ci:project-deploy-observability Create an Observability project label Jul 11, 2024
Copy link
Contributor

@tommyers-elastic tommyers-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - but do we need to also update the component templates?

the changes i think we need are to remove displayName from the base template, and include it only in the latest; and to map identityFields as a keyword in the base template.

(while we're at it, we should remove firstSeenTimestamp from the shared mapping too, and include that alongside displayName in the latest template only)

},
},
{
// This must happen AFTER we lift the identity fields into the root of the document
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@elasticmachine
Copy link
Contributor

elasticmachine commented Jul 12, 2024

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

cc @miltonhultgren

@miltonhultgren miltonhultgren merged commit 66e3f08 into elastic:main Jul 12, 2024
24 checks passed
Comment on lines +13 to +21
expect(initializePathScript('someField')).toMatchInlineSnapshot(`
"

if (ctx.someField == null) {
ctx.someField = new HashMap();
}
"
`);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a single level field like tags, you don't need to initialize the new HashMap();

When you refactored this code you missed where it tests the currentIndex + 1 === parts.length. If parts.length is 1 and the currentIndex is 0 then you're already at the end and you can just assign the value, there is no need to instantiate a new HashMap().



if (ctx.some.nested.field == null) {
ctx.some.nested.field = new HashMap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're at the end of the path, there is no need for this.

Comment on lines +50 to +61
"entity.identity.event.category": Object {
"terms": Object {
"field": "event.category",
"size": 1,
},
},
"entity.identity.log.logger": Object {
"terms": Object {
"field": "log.logger",
"size": 1,
},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are single value fields, I wonder if we shouldn't have used top_metric and then used the same set processor as the history?

kpatticha added a commit that referenced this pull request Jul 22, 2024
## Summary

closes: #188761

### changes

- identityFields returns only the fields, query directly service name
and service environment from entity document (EEM
[change](#187699))
- Rename `logRatePerMinute` to `logRate` (EEM
[change](#187021))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:project-deploy-observability Create an Observability project Feature:EEM Elastic Entity Model release_note:skip Skip the PR/issue when compiling release notes v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants