[EEM] Remove duplicates from latest data set #187699

miltonhultgren · 2024-07-05T16:06:52Z

By only grouping on entity.id we should be able to remove duplicates in the latest indices.
This PR also removes the values found for entity.identityFields and replaces it with a list of those field names.
This PR also lifts the values for the identity fields to the root of the document.
This PR removes the displayName from the historical documents.

How to test

Source data:

PUT index_a
{
  "mappings": {
    "properties": {
      "a": {
        "type": "keyword"
      },
      "@timestamp": {
        "type": "date"
      }
    }
  }
}

PUT index_b
{
  "mappings": {
    "properties": {
      "b": {
        "type": "keyword"
      },
      "@timestamp": {
        "type": "date"
      }
    }
  }
}

POST index_a/_doc
{
  "a": "same",
  "@timestamp": "2024-07-05T12:33:06.162Z"
}

POST index_b/_doc
{
  "b": "same",
  "@timestamp": "2024-07-05T12:33:06.162Z"
}

Entity definition:

POST kbn:/internal/api/entities/definition
{
  "id": "bucket_key",
  "name": "Bucket key",
  "type": "service",
  "indexPatterns": [
    "index_*"
  ],
  "timestampField": "@timestamp",
  "lookback": "5m",
  "identityFields": [
    {
      "field": "a",
      "optional": true
    },
    {
      "field": "b",
      "optional": true
    }
  ],
  "displayNameTemplate": "{{a}}{{b}}",
  "history": {
    "timestampField": "@timestamp",
    "interval": "5m"
  }
}

Change in the format of the resulting documents

"identityFields": {
  "a": null,
  "b": "same"
},

=>

"identityFields": [
  "a",
  "b"
],

obltmachine · 2024-07-05T16:07:02Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

...y_solution/entity_manager/server/lib/entities/ingest_pipeline/generate_history_processors.ts

...ty_solution/entity_manager/server/lib/entities/ingest_pipeline/generate_latest_processors.ts

...rvability_solution/entity_manager/server/lib/entities/transform/generate_latest_transform.ts

tommyers-elastic

this is the right direction

can we also remove displayName from the history docs (since it should only use one (and maybe not the one in the current doc) identityField), and add it to the latest docs

...y_solution/entity_manager/server/lib/entities/ingest_pipeline/generate_history_processors.ts

...rvability_solution/entity_manager/server/lib/entities/transform/generate_latest_transform.ts

...y_solution/entity_manager/server/lib/entities/ingest_pipeline/generate_history_processors.ts

miltonhultgren · 2024-07-08T14:40:43Z

I'm going to split this PR in two:

Lifting identifyFields up to root in both datasets
Removing duplication in the latest dataset with tests that use DataForge for data generation

…-latest

miltonhultgren · 2024-07-09T16:03:32Z

can we also remove displayName from the history docs (since it should only use one (and maybe not the one in the current doc) identityField), and add it to the latest docs

@tommyers-elastic
So, in the history docs we would track the values of the identity fields (by storing them at the root).
And then we'd grab those values in the latest transform, and expect to find a single value and use that value to create the display name?

Are we sure we won't need the display name for the history documents? In the UI I guess not because we'll likely enter the history from the latest, so we can hang on to the display name from there.

But what about if there are more than one value found in the latest transform, I'm still not sure how to handle that.
I almost feel like it would be correct to have the ingest pipeline throw errors for that.
What do you think? (In either case, I'll likely address this in the next PR)

Also, this PR is ready for review again!

tommyers-elastic · 2024-07-09T16:12:08Z

@miltonhultgren the reason for displayName in latest only is because of:

say you have an entity definition with two fields that both contain the user identifier, user.name and labels.user_name; in the history documents you have one or the other. so you might construct a template like {{#user.name}}{{.}}{{/user.name}}{{#labels.user_name}}{{.}}{{/labels.user_name}}. but then when these are combined in the latests docs and both identity fields exist in the document, you get a display name with both fields populated.

so you just put a display name with one of the fields like {{user.name}}. if you include display name in the history documents, only some of them get a value populated, but now in the latest transform you get the correct value.

…-latest

tommyers-elastic

LGTM - but do we need to also update the component templates?

the changes i think we need are to remove displayName from the base template, and include it only in the latest; and to map identityFields as a keyword in the base template.

(while we're at it, we should remove firstSeenTimestamp from the shared mapping too, and include that alongside displayName in the latest template only)

tommyers-elastic · 2024-07-11T16:39:44Z

...ty_solution/entity_manager/server/lib/entities/ingest_pipeline/generate_latest_processors.ts

+      },
+    },
+    {
+      // This must happen AFTER we lift the identity fields into the root of the document


…-latest

elasticmachine · 2024-07-12T13:23:55Z

💚 Build Succeeded

Buildkite Build
Commit: be52863
Storybooks Preview
Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-187699-be528630ad3e
Observability Deployment

Metrics [docs]

✅ unchanged

History

💚 Build #220942 succeeded ea9a00b
💛 Build #220888 was flaky 34a463f
💔 Build #220415 failed 54b70e1

cc @miltonhultgren

simianhacker · 2024-07-12T18:28:35Z

.../entity_manager/server/lib/entities/helpers/ingest_pipeline_script_processor_helpers.test.ts

+      expect(initializePathScript('someField')).toMatchInlineSnapshot(`
+        "
+
+                if (ctx.someField == null) {
+                    ctx.someField = new HashMap();
+                }
+              "
+      `);
+    });


If it's a single level field like tags, you don't need to initialize the new HashMap();

When you refactored this code you missed where it tests the currentIndex + 1 === parts.length. If parts.length is 1 and the currentIndex is 0 then you're already at the end and you can just assign the value, there is no need to instantiate a new HashMap().

simianhacker · 2024-07-12T18:30:02Z

.../entity_manager/server/lib/entities/helpers/ingest_pipeline_script_processor_helpers.test.ts

+
+
+                if (ctx.some.nested.field == null) {
+                    ctx.some.nested.field = new HashMap();


You're at the end of the path, there is no need for this.

simianhacker · 2024-07-12T19:00:38Z

...y_manager/server/lib/entities/transform/__snapshots__/generate_latest_transform.test.ts.snap

+      "entity.identity.event.category": Object {
+        "terms": Object {
+          "field": "event.category",
+          "size": 1,
+        },
+      },
+      "entity.identity.log.logger": Object {
+        "terms": Object {
+          "field": "log.logger",
+          "size": 1,
+        },
+      },


Since these are single value fields, I wonder if we shouldn't have used top_metric and then used the same set processor as the history?

## Summary closes: #188761 ### changes - identityFields returns only the fields, query directly service name and service environment from entity document (EEM [change](#187699)) - Rename `logRatePerMinute` to `logRate` (EEM [change](#187021))

miltonhultgren self-assigned this Jul 5, 2024

miltonhultgren commented Jul 5, 2024

View reviewed changes

...y_solution/entity_manager/server/lib/entities/ingest_pipeline/generate_history_processors.ts Outdated Show resolved Hide resolved

miltonhultgren commented Jul 5, 2024

View reviewed changes

...ty_solution/entity_manager/server/lib/entities/ingest_pipeline/generate_latest_processors.ts Outdated Show resolved Hide resolved

miltonhultgren commented Jul 5, 2024

View reviewed changes

...rvability_solution/entity_manager/server/lib/entities/transform/generate_latest_transform.ts Show resolved Hide resolved

tommyers-elastic reviewed Jul 8, 2024

View reviewed changes

miltonhultgren commented Jul 8, 2024

View reviewed changes

...y_solution/entity_manager/server/lib/entities/ingest_pipeline/generate_history_processors.ts Outdated Show resolved Hide resolved

tommyers-elastic added the Feature:EEM Elastic Entity Model label Jul 8, 2024

miltonhultgren added 2 commits July 9, 2024 15:31

[EEM] Remove duplicates from latest data set

3d48616

Add list of identity fields and bring back displayName

e5c3f8b

miltonhultgren force-pushed the eem-deduplicate-latest branch from 74b1e91 to e5c3f8b Compare July 9, 2024 15:54

Merge branch 'main' of github.com:elastic/kibana into eem-deduplicate…

b9a1f7a

…-latest

miltonhultgren marked this pull request as ready for review July 9, 2024 16:05

miltonhultgren requested a review from a team as a code owner July 9, 2024 16:05

Merge branch 'main' into eem-deduplicate-latest

54b70e1

miltonhultgren requested a review from tommyers-elastic July 9, 2024 16:06

botelastic bot added the ci:project-deploy-observability Create an Observability project label Jul 9, 2024

miltonhultgren added release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting and removed ci:project-deploy-observability Create an Observability project labels Jul 9, 2024

miltonhultgren mentioned this pull request Jul 9, 2024

[EEM] Include identity fields and values in document root #187884

Closed

miltonhultgren added 2 commits July 11, 2024 15:25

Lift identity fields to root, update displayName generation

011ab44

Merge branch 'main' of github.com:elastic/kibana into eem-deduplicate…

cf11a7f

…-latest

botelastic bot added the ci:project-deploy-observability Create an Observability project label Jul 11, 2024

miltonhultgren added 2 commits July 11, 2024 15:29

Extract function

34a463f

Merge branch 'main' into eem-deduplicate-latest

ea9a00b

tommyers-elastic requested a review from simianhacker July 11, 2024 16:31

tommyers-elastic approved these changes Jul 11, 2024

View reviewed changes

miltonhultgren added 3 commits July 12, 2024 10:24

Merge branch 'main' of github.com:elastic/kibana into eem-deduplicate…

ada98ab

…-latest

Merge branch 'main' of github.com:elastic/kibana into eem-deduplicate…

be83608

…-latest

Update mappings, fix script pathing, add script helpers

336256f

miltonhultgren requested a review from tommyers-elastic July 12, 2024 12:21

Merge branch 'main' into eem-deduplicate-latest

be52863

miltonhultgren merged commit 66e3f08 into elastic:main Jul 12, 2024
24 checks passed

kibanamachine added the v8.16.0 label Jul 12, 2024

simianhacker reviewed Jul 12, 2024

View reviewed changes

This was referenced Jul 19, 2024

[APM] Test end-to-end EEM enablement and the new service inventory experience #188521

Closed

[APM] Replace any reference to identityFields #188761

Closed

[APM] Updated eem schema #188763

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EEM] Remove duplicates from latest data set #187699

[EEM] Remove duplicates from latest data set #187699

miltonhultgren commented Jul 5, 2024 •

edited

Loading

obltmachine commented Jul 5, 2024

tommyers-elastic left a comment

miltonhultgren commented Jul 8, 2024

miltonhultgren commented Jul 9, 2024

tommyers-elastic commented Jul 9, 2024

tommyers-elastic left a comment

tommyers-elastic Jul 11, 2024

elasticmachine commented Jul 12, 2024 •

edited

Loading

simianhacker Jul 12, 2024

simianhacker Jul 12, 2024

simianhacker Jul 12, 2024



		if (ctx.some.nested.field == null) {
		ctx.some.nested.field = new HashMap();

[EEM] Remove duplicates from latest data set #187699

[EEM] Remove duplicates from latest data set #187699

Conversation

miltonhultgren commented Jul 5, 2024 • edited Loading

How to test

Change in the format of the resulting documents

obltmachine commented Jul 5, 2024

🤖 GitHub comments

tommyers-elastic left a comment

Choose a reason for hiding this comment

miltonhultgren commented Jul 8, 2024

miltonhultgren commented Jul 9, 2024

tommyers-elastic commented Jul 9, 2024

tommyers-elastic left a comment

Choose a reason for hiding this comment

tommyers-elastic Jul 11, 2024

Choose a reason for hiding this comment

elasticmachine commented Jul 12, 2024 • edited Loading

💚 Build Succeeded

Metrics [docs]

History

simianhacker Jul 12, 2024

Choose a reason for hiding this comment

simianhacker Jul 12, 2024

Choose a reason for hiding this comment

simianhacker Jul 12, 2024

Choose a reason for hiding this comment

miltonhultgren commented Jul 5, 2024 •

edited

Loading

elasticmachine commented Jul 12, 2024 •

edited

Loading