Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically flatten objects when subobjects:false #97972

Merged

Conversation

piergm
Copy link
Member

@piergm piergm commented Jul 26, 2023

While ingesting documents that contain nested object and the
mapping property subobjects is set to false instead of throwing
a mapping exception and dropping the document(s), we map only
leaf field(s) with their full path as their name.

This has being done by avoiding dynamically mapping of the
intermediate objects and retaining dottedFieldName while parsing.

After this change with the following root level mapping:
{ "subobjects" : false, "properties" : { } }
the ingest both of the following documents will result in the same
mapped fields (time.min and time.max) instead of having the
second rejected:

  1. { "time.min": 1, "time.max": 2 }
  2. { "time": { "max": 1, "max": 2 } }

Subtasks:

  • Skip dynamically mapping objects when parsing a document and an
    object is encountered
  • Fields capable of parsing objects natively, such as geo_point, will now
    be dynamically mapped exclusively through dynamic templates.
  • Provide parsed objects to field mappers only when they are supported

closes #88934

@piergm piergm added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team v8.10.0 labels Jul 26, 2023
@piergm piergm self-assigned this Jul 26, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Hi @piergm, I've created a changelog YAML for you.

@piergm piergm requested a review from romseygeek July 26, 2023 13:29
@felixbarny
Copy link
Member

I'm really excited that this is happening. Thanks for working on it ❤️

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another , more in-depth round, and left some comments

@piergm
Copy link
Member Author

piergm commented Aug 22, 2023

@elasticsearchmachine run elasticsearch-ci/packaging-tests-windows-sample

@romseygeek
Copy link
Contributor

I wonder if it's possible to make these changes without touching ContentPath, which is getting considerably more complicated here. I think we can keep all our changes within DocumentParser#parseObjectDynamic, and do something like: if we have a dynamically created object mapper here but subobjects are disallowed, then throw said object mapper away and instead create a new DocumentParserContext with a wrapped XContentParser that prepends the current field name to any of its immediate children, and pass this to parseObjectOrField. I think this would also simplify the handling of the dynamic runtime case: where we currently create a fake object mapper for the context to use, we instead modify the xcontentparser to include the extra field level when it reports field names but stick with the same ObjectMapper parent.

Have a look at XContentSubParser to get an idea of how this might be implemented - keep track of the nesting level, and if we're at level 1 when currentName() is called, return the delegate name with the extra field name prepended, otherwise just delegate.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @piergm - I left a couple of nits and have one question.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all the iterations!

@piergm piergm merged commit 392c497 into elastic:main Aug 24, 2023
bpintea pushed a commit to bpintea/elasticsearch that referenced this pull request Aug 25, 2023
While ingesting documents that contain nested objects and the
mapping property subobjects is set to false instead of throwing
a mapping exception and dropping the document(s), we map only
leaf field(s) with their full path as their name separated by dots.
assertEquals("parent.child2", secondChildName);
assertEquals(XContentParser.Token.START_OBJECT, subParser.nextToken());
assertEquals(XContentParser.Token.FIELD_NAME, subParser.nextToken());
assertEquals("grandChild", subParser.currentName());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me understand why we need to prepend the path only if level is 1? shouldn't this be also parent.child2.grandChild ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time that we have a dynamically created object mapper and subobjects:false we create a new context with a FlatteningXContentParser. This is explicit when we create subParser for field parent and is not done (for tests purpouses) for the field child2. In the normal code execution since 'child2' is a dynamically created object mapper itself we would have created a sub-subParser passing as parent name subParser.currentName() that is parent.child2 as was done in this test and therefore the current name here would have being parent.child2.grandChild.
Is for this reason that we only "care" about immediate child, since grandchildren are handled recursively. Otherwise we would have created somewhat of a clone of ContentPath.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for your patient explanation ;)

* @throws IOException If an I/O error occurs during parsing.
*/
@Override
public String currentName() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit surprising that the only method that needs to be overridden here is currentName. If we are flattening docs, shouldn't nextToken get adjusted too in some ways? I find this a little hard to follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need to distinguish between deeply nested leaf fields, and dynamically created mappers that accept an object, we only flatten a single level at a time. We don't need to adjust anything apart from the lowest-level name, because once we move into a new object we will either pass it to a mapper to parse, or wrap it again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are using the logic of XContentSubParser that I extend. There is keeping track of the current nested level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I think that we should make this parser private and enforce the expected behaviour. If we expect nextToken not to be called on it, have it throw unsupported operation exception? My intention is to reduce the blast radius / unexpected scenarios and making the parser less generic if possible. What do you both think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still expect nextToken to be called, but it delegates to the XContentSubParser implementation (which keeps track of levels, etc). But yes, +1 to making the parser private.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on the private parser, but as @romseygeek said we must not throw when nextToken is called, we expect to call the parent's class nextToken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, thanks.Let's also see if there's other assumptions we are making that could be enforced directly in the parser to prevent it from being misused. Maybe nothing to do, not exactly sure.

* A subclass of XContentSubParser that provides the functionality to flatten
* the field names by prefixing them with the provided parent name.
*/
public class FlatteningXContentParser extends XContentSubParser {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for chiming in late, I'd like to better understand how the additional complexity in ContentPath around tracking the full path is replaced with this custom parser. In a previous experience with DotExpandingXContentParser, we have seen that having a custom parser can have quite a few side effects that are difficult to predict (do we forward all the right methods? is the parser only used how we think it is?), because the parser exposes a pretty generic interface that is used in many different ways and we are customizing it for a very specific scenario.

While additional methods to ContentPath increase its surface, I find that they make the contract easier to reason about and easier to track as well. While I agree that ContentPath is not a fantastic abstraction and needs some love, I am not sure that adding a new parser addresses that fairly, in that it introduces all kinds of unknowns, at least in my opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the parser exposes a pretty generic interface that is used in many different ways and we are customizing it for a very specific scenario.

That's a fair point - maybe we should make this a private class within DocumentParserContext?

My reason for preferring it this way is that is makes it much simpler to read the DocumentParser code, which is already fairly gnarly. It also opens the way to further simplifications - I played around a bit with rewriting the Dynamic.RUNTIME logic to use this and it mostly just works, although we'd need to revisit how we handle fields that contain spaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that DocumentParser could be simplified, greatly. But I think that this type of changes just end up moving the complexity elsewhere and possibly cause collateral issues that are difficult to predict. I am being conservative, I know, that's because I've been bitten with DotExpandingXContentParser before.

@@ -486,15 +499,16 @@ private static void parseObjectDynamic(DocumentParserContext context, String cur
);
}
if (dynamicObjectMapper instanceof ObjectMapper) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: here we don't know how the object mapper was created dynamically. In most cases this comes from the usual dynamic mappings (intermediate objects created automatically), but are there cases in which an object may be the result of applying a dynamic template?

The main purpose of this conditional is that we do want to apply dynamic templates that e.g. create ip fields out of objects from docs, and we know implicitly that these can only come from dynamic template because we have no default dynamic mapping for ip fields (or any field that supports parsing objects natively for that matter). I think though that we are also making the assumption that objects are only the result of the default dynamic mappings, while they could be the result of applying a dynamic template. I am not entirely sure what this would cause: such objects could have a different dynamic or enabled property which we are ignoring. I am thinking we may want to consider throwing exception if we get an object mapper as a result of applying a dynamic template. Ideally, we would apply only the dynamic template here so we don't have to guess what happened.

Does this make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out and you are correct there could be the case in which an object may be result of applying dynamic templates.
I do agree on the solution of throwing an exception in this case.
I'll create a PR with all the suggested improvements.
Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically flatten objects when subobjects: false
6 participants