Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split processor in Pipelines removing the trailing empty values #48498

Closed
pankaj-k opened this issue Oct 25, 2019 · 1 comment · Fixed by #48664
Closed

Split processor in Pipelines removing the trailing empty values #48498

pankaj-k opened this issue Oct 25, 2019 · 1 comment · Fixed by #48664
Assignees
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP

Comments

@pankaj-k
Copy link

Elasticsearch version (7.2.0):

Plugins installed: [none]

JVM version: java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

OS version : Windows 10, Version 1709

Description of the problem including expected versus actual behavior:
Using the split processor (since there is no csv processor in pipeline) to split a csv line drops the trailing empty spaces.
A,,B,, gives A, '', B.
Expected behaviour is : A, '', B, '', ''

In Java the default behaviour is this only but they provide an overload of passing -1 as a parameter to retain the trailing empty spaces. There is no such support in split processor.

Steps to reproduce:

  1. Create a simple pipeline:
PUT _ingest/pipeline/test_pipeline
{
  "description": "test",
  "processors": [
    {
      "split": {
        "field": "message",
        "target_field": "splitdata",
        "separator": ","
      }
    }
  ]
}
  1. Test it.
GET _ingest/pipeline/test_pipeline/_simulate
{
  "docs": [
    {
      "_source" :{
        "message" : "A,,B,,"
      }
    }
  ]
}
  1. Results
{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "message" : "A,,B,,",
          "splitdata" : [
            "A",
            "",
            "B"
          ]
        },
        "_ingest" : {
          "timestamp" : "2019-10-23T04:25:26.277Z"
        }
      }
    }
  ]
}

Two empty fields after the character 'B' are dropped.

  1. Test with a different input.
GET _ingest/pipeline/test_pipeline/_simulate
{
  "docs": [
    {
      "_source" :{
        "message" : "A,,B,,C"
      }
    }
  ]
}
  1. Result:
{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "message" : "A,,B,,C",
          "splitdata" : [
            "A",
            "",
            "B",
            "",
            "C"
          ]
        },
        "_ingest" : {
          "timestamp" : "2019-10-23T04:27:38.400Z"
        }
      }
    }
  ]
}

The empty values are preserved.

@cbuescher cbuescher added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Oct 25, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Ingest)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants