Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double counting of documents processed when both a default/request and a final pipeline are used #92843

Closed
joegallo opened this issue Jan 11, 2023 · 1 comment · Fixed by #93329
Assignees
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@joegallo
Copy link
Contributor

joegallo commented Jan 11, 2023

PUT _ingest/pipeline/pipeline-3
{
 "processors": [
  {
   "set": {
    "field": "field-3",
    "value": "pipeline-3"
   }
  }
 ]
}

PUT _ingest/pipeline/pipeline-2
{
 "processors": [
  {
   "set": {
    "field": "field-2",
    "value": "pipeline-2"
   }
  }
 ]
}

PUT _ingest/pipeline/pipeline-1
{
 "processors" : [
  {
   "set": {
    "field": "field-1",
    "value": "pipeline-1"
   }
  },
  {
   "pipeline": {
    "name": "pipeline-2"
   }
  }
 ]
}

PUT index-1

PUT index-1/_settings
{
 "index" : {
  "default_pipeline": "pipeline-1",
  "final_pipeline": "pipeline-3"
 }
}

POST _bulk
{ "index" : { "_index" : "index-1" } }
{ "doc_id" : 0 }

POST index-1/_search

GET _nodes/stats?filter_path=nodes.*.ingest

The above creates three pipelines that each record their name into any documents that are processed. Note that pipeline-1 will call pipeline-2, and that pipeline-1 is installed as the default_pipeline while pipeline-3 is installed as the final_pipeline (reminder: both the default_pipeline and the final_pipeline will be executed when a document is indexed).

The _search near the end will give a result like this:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "index-1",
        "_id": "DFf9oYUBqS6z6DmM3nW4",
        "_score": 1,
        "_source": {
          "field-3": "pipeline-3",
          "field-1": "pipeline-1",
          "field-2": "pipeline-2",
          "doc_id": 0
        }
      }
    ]
  }
}

Note that fields 1-3 all have the expected value, indicating that all three processors executed against this document.

The _nodes/stats call will give a result like this:

{
  "nodes": {
    "e3BFjLN8STSspUcCnDcXkQ": {
      "ingest": {
        "total": {
          "count": 2,
          [...]
        },
        "pipelines": {
          "pipeline-1": {
            "count": 1,
            [...]
          },
          "pipeline-2": {
            "count": 1,
            [...]
          },
          "pipeline-3": {
            "count": 1,
            [...]
          }
        }
      }
    }
  }
}

The "count" for each individual pipeline is correct – each pipeline was executed against a single document. However, we're double counting at the top-level, the "total" "count" is 2 but should have only been 1 (there was only just one document).

@joegallo joegallo added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team labels Jan 11, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
2 participants