Double counting of documents processed when both a default/request and a final pipeline are used #92843

joegallo · 2023-01-11T18:09:45Z

PUT _ingest/pipeline/pipeline-3
{
 "processors": [
  {
   "set": {
    "field": "field-3",
    "value": "pipeline-3"
   }
  }
 ]
}

PUT _ingest/pipeline/pipeline-2
{
 "processors": [
  {
   "set": {
    "field": "field-2",
    "value": "pipeline-2"
   }
  }
 ]
}

PUT _ingest/pipeline/pipeline-1
{
 "processors" : [
  {
   "set": {
    "field": "field-1",
    "value": "pipeline-1"
   }
  },
  {
   "pipeline": {
    "name": "pipeline-2"
   }
  }
 ]
}

PUT index-1

PUT index-1/_settings
{
 "index" : {
  "default_pipeline": "pipeline-1",
  "final_pipeline": "pipeline-3"
 }
}

POST _bulk
{ "index" : { "_index" : "index-1" } }
{ "doc_id" : 0 }

POST index-1/_search

GET _nodes/stats?filter_path=nodes.*.ingest

The above creates three pipelines that each record their name into any documents that are processed. Note that pipeline-1 will call pipeline-2, and that pipeline-1 is installed as the default_pipeline while pipeline-3 is installed as the final_pipeline (reminder: both the default_pipeline and the final_pipeline will be executed when a document is indexed).

The _search near the end will give a result like this:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "index-1",
        "_id": "DFf9oYUBqS6z6DmM3nW4",
        "_score": 1,
        "_source": {
          "field-3": "pipeline-3",
          "field-1": "pipeline-1",
          "field-2": "pipeline-2",
          "doc_id": 0
        }
      }
    ]
  }
}

Note that fields 1-3 all have the expected value, indicating that all three processors executed against this document.

The _nodes/stats call will give a result like this:

{
  "nodes": {
    "e3BFjLN8STSspUcCnDcXkQ": {
      "ingest": {
        "total": {
          "count": 2,
          [...]
        },
        "pipelines": {
          "pipeline-1": {
            "count": 1,
            [...]
          },
          "pipeline-2": {
            "count": 1,
            [...]
          },
          "pipeline-3": {
            "count": 1,
            [...]
          }
        }
      }
    }
  }
}

The "count" for each individual pipeline is correct – each pipeline was executed against a single document. However, we're double counting at the top-level, the "total" "count" is 2 but should have only been 1 (there was only just one document).

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-01-11T18:10:08Z

Pinging @elastic/es-data-management (Team:Data Management)

joegallo added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team labels Jan 11, 2023

joegallo self-assigned this Jan 11, 2023

This was referenced Jan 20, 2023

Add an IngestService stats test #93120

Merged

Handle a default/request pipeline and a final pipeline with minimal additional overhead #93329

Merged

joegallo closed this as completed in #93329 Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double counting of documents processed when both a default/request and a final pipeline are used #92843

Double counting of documents processed when both a default/request and a final pipeline are used #92843

joegallo commented Jan 11, 2023 •

edited

Loading

elasticsearchmachine commented Jan 11, 2023

Double counting of documents processed when both a default/request and a final pipeline are used #92843

Double counting of documents processed when both a default/request and a final pipeline are used #92843

Comments

joegallo commented Jan 11, 2023 • edited Loading

elasticsearchmachine commented Jan 11, 2023

joegallo commented Jan 11, 2023 •

edited

Loading