Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

composite and terms aggregations behave differently on ip fields #50600

Closed
benwtrent opened this issue Jan 3, 2020 · 15 comments · Fixed by #50869
Closed

composite and terms aggregations behave differently on ip fields #50600

benwtrent opened this issue Jan 3, 2020 · 15 comments · Fixed by #50869
Assignees

Comments

@benwtrent
Copy link
Member

TL;DR composite aggs behave differently than terms agg when there is an index in the pattern that matches the prefix of the other indices that does NOT have a mapped field matching the aggregated field. They behave the same if the "prefix index" has the field, but it is indexed as the wrong type.

Assume we have the following indices

PUT test_index-0001
{
  "mappings": {
    "properties": {
      "ip": {
        "type": "ip"
      }
    }
  }
}

PUT test_index-0002
{
  "mappings": {
    "properties": {
      "ip": {
        "type": "ip"
      }
    }
  }
}

PUT test_index-0001/_doc/1
{
  "ip": "10.2.1.1"
}
PUT test_index-0002/_doc/1
{
  "ip": "10.1.1.2"
}

Lets add a new index that does not have an ip field

PUT test_index-0003
{
  "mappings": {
    "properties": {
      "somefield": {
        "type": "ip"
      }
    }
  }
}

Composite aggs behaves as follows (as of 7.5).

GET test_index-0001,test_index-0002,test_index*/_search?size=0
{
  "aggs": {
    "composite_ips": {
      "composite": {
        "sources": [
          {
            "ip": {
              "terms": {
                "field": "ip"
              }
            }
          }
        ]
      }
    }
  }
}
>
"aggregations" : {
    "composite_ips" : {
      "after_key" : {
        "ip" : "10.2.1.1"
      },
      "buckets" : [
        {
          "key" : {
            "ip" : "10.1.1.2"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "ip" : "10.2.1.1"
          },
          "doc_count" : 1
        }
      ]
    }
  }

But if the index that is added is instead:

PUT test_index
{
  "mappings": {
    "properties": {
      "somefield": {
        "type": "ip"
      }
    }
  }
}

The following is returned from the same composite agg (image for readability)
image

Note the difference with a plain terms agg

GET test_index-0001,test_index-0002,test_index*/_search?size=0
{
  "aggs": {
    "term_ips": {
      "terms": {
        "field": "ip"
      }
    }
  }
}
>
"aggregations" : {
    "term_ips" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "10.1.1.2",
          "doc_count" : 1
        },
        {
          "key" : "10.2.1.1",
          "doc_count" : 1
        }
      ]
    }
  }

NOTE: if the test_index has the following mapping, composite aggs and terms behave the SAME (showing raw bytes instead of human readable strings)

DELETE test_index
PUT test_index
{
  "mappings": {
    "properties": {
      "ip": {
        "type": "ip"
      }
    }
  }
}

GET test_index-0001,test_index-0002,test_index*/_search?size=0
{
  "aggs": {
    "term_ips": {
      "terms": {
        "field": "ip"
      }
    }
  }
}

(image for readability)
image

Related issues/PRs:

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@nik9000 nik9000 self-assigned this Jan 3, 2020
@nik9000
Copy link
Member

nik9000 commented Jan 3, 2020

The bug is that that rendering is full of bizzare characters, right? Just making sure.

@benwtrent
Copy link
Member Author

benwtrent commented Jan 3, 2020

@nik9000 I think there are probably two bugs:

  1. That composite aggs behaves differently than terms in scenario 1. Composite shouldn't return garbage if terms does not.
  2. That when the types are are different on the same field name (ip vs. keyword) both terms and composite return garbage. We should either return a nice error or just handle it.

@nik9000
Copy link
Member

nik9000 commented Jan 6, 2020


NOTE: if the test_index has the following mapping, composite aggs and terms behave the SAME (showing raw bytes instead of human readable strings)

This doesn't reproduce for me in master and 7.x. I'm checking 7.5 now.

@nik9000
Copy link
Member

nik9000 commented Jan 6, 2020

And I can't reproduce it on 7.5.1 either. Here is what I ran.

I can reproduce the composite aggs showing raw bytes though. I'll dig into that one.

@nik9000
Copy link
Member

nik9000 commented Jan 6, 2020

The composite agg reduces itself by picking the formatter from the "first" index. That seems wrong. I'm checking on what the terms agg does.

@nik9000
Copy link
Member

nik9000 commented Jan 6, 2020

The terms aggregation doesn't have this problem because it converts IPs to string on the bucket side. Composite aggregations aren't designed like that so they'll want a different solution to the problem.

@nik9000
Copy link
Member

nik9000 commented Jan 6, 2020

So the composite agg is designed around every the fields in the composite all being the same type. When they aren't it trips assertions, or, if those aren't enabled, it behaves very oddly. I think our options are:

  1. Fail the aggregation if the field types don't match.
  2. Make it support merging field types. The trick is that that isn't a canonical ordering for field types. But there has to be one for composite aggs. We could certainly make one.

Option 2 seem like a bigger lift. I think maybe option 1 is more kind in the short term with option two being something we can talk about later.

I'd like to mention that the terms aggregation doesn't have this problem quite as bad because the conversion from ip to string is done on the data nodes instead of the coordinating nodes. I expect we don't want to do this because strings won't have the same ordering as IPs and composite aggs need reductions to order "at least" as well as we order on the data node.

@nik9000
Copy link
Member

nik9000 commented Jan 6, 2020

We can certainly implement either of the solutions I described above so if the field isn't mapped in some of the indices then everything "just works" which should fix the issues @benwtrent described above.

@nik9000
Copy link
Member

nik9000 commented Jan 6, 2020

I wonder how hard it'd be to add the formatter to the bucket instead of leaving it global to the aggregation. That'd be a chunk of how we'd add support for different field types and it'd fix the errors above at the cost of a couple more bytes across the wire. Which sounds ok with me.

@nik9000
Copy link
Member

nik9000 commented Jan 7, 2020

Another interesting thing - if we wanted to support polymorphic types for the terms in the composite we could use the type information from the formatter and annotate the buckets with that type information. We could also add that type information to the sorting. That means you could get two buckets with the same key but the types would at least be different. Those two buckets likely wouldn't be on the same "page" anyway because we'd do the coarsest sort on type.

@not-napoleon
Copy link
Member

Just as a note, I think Terms does the wrong thing here. Converting from IP to String creates a bunch of edge cases, since a given IP may map to multiple strings (e.g. IPv6 optionally omitting multiple zeros). This can lead to subtle errors where two strings which should be the same IP address end up as different buckets in the aggregation. IMHO, the right thing to do is to enforce that fields must be the same type for all indexes in the aggregation.

@nik9000
Copy link
Member

nik9000 commented Jan 9, 2020

I have a prototype locally that produces the right answer when aggregating across two field IPs and unmapped fields which is quite simple. I think that is probably worth getting in because we do want to support that.

I've played a little with trying to support polymorphic types. It'd be tricky to get it right because of "after_key", I think.

@not-napoleon
Copy link
Member

+1 on fixing it for unmapped fields, that should definitely be something we support. If you want to PR that, I'll be happy to review it.

@nik9000
Copy link
Member

nik9000 commented Jan 10, 2020

+1 on fixing it for unmapped fields, that should definitely be something we support. If you want to PR that, I'll be happy to review it.

That is my plan! I'm trying to reproduce a semi-related bug at the moment but should open a PR today if all goes well.

nik9000 added a commit to nik9000/elasticsearch that referenced this issue Jan 10, 2020
When a composite aggregation is reduced using the results from an
index that has one of the fields unmapped we were throwing away the
formatter. This is mildly annoying, except in the case of IP addresses
which were coming out as non-utf-8-characters. And tripping assertions.

This carefully preserves the formatter from the working bucket.

Closes elastic#50600
nik9000 added a commit that referenced this issue Jan 10, 2020
When a composite aggregation is reduced using the results from an
index that has one of the fields unmapped we were throwing away the
formatter. This is mildly annoying, except in the case of IP addresses
which were coming out as non-utf-8-characters. And tripping assertions.

This carefully preserves the formatter from the working bucket.

Closes #50600
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Jan 10, 2020
When a composite aggregation is reduced using the results from an
index that has one of the fields unmapped we were throwing away the
formatter. This is mildly annoying, except in the case of IP addresses
which were coming out as non-utf-8-characters. And tripping assertions.

This carefully preserves the formatter from the working bucket.

Closes elastic#50600
nik9000 added a commit that referenced this issue Jan 10, 2020
When a composite aggregation is reduced using the results from an
index that has one of the fields unmapped we were throwing away the
formatter. This is mildly annoying, except in the case of IP addresses
which were coming out as non-utf-8-characters. And tripping assertions.

This carefully preserves the formatter from the working bucket.

Closes #50600
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
When a composite aggregation is reduced using the results from an
index that has one of the fields unmapped we were throwing away the
formatter. This is mildly annoying, except in the case of IP addresses
which were coming out as non-utf-8-characters. And tripping assertions.

This carefully preserves the formatter from the working bucket.

Closes elastic#50600
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants