-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative approach to logical type variations. #25
Alternative approach to logical type variations. #25
Conversation
enum PhysicalType { | ||
SYSTEM_DEFAULT = 0; | ||
} | ||
message Variation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible more natural way of modeling this in protobuf is to use have different physical type systems define extensions on either the Type
or Pair
message Did you consider that as an alternative? (I don't think this needs to hold up the PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see much benefit to the extensions approach in this case. Generally, it is an interesting question whether to expose some objects with extensions but I generally feel like any kind of pure black box that an extension would provide would weaken the spec. In the pattern here the spec still has information about how the extended functionality works, even if declared beyond the core project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extensions provide the ability to more strongly model variations. There would still be two classes:
- In spec (those checked into this repo) and an extension range would be reserved for these).
- Those not checked into the repo (blackbox) and would live in a separate range.
I was thinking have one extension for each system type. e.g. ArrowVarationExtension
or AvroVariationExtension
. At least for Arrow variations seem like they might stack (largestring and dictionary encoding).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really seem like extensions provide a stronger model. For example, when trying to decide whether a particular function can be bound to a particular type:
- With the approach provided in the PR, the rules about binding are clear. If the variation inherits from logical, the functions can be bound to any function declared against the base logical type. It also fully defines what a complete type definition is.
- If using extensions, when trying to bind, can you bind to a function definition for the base logical type against a type that has one or more extensions properties defined?
I also haven't seen extensions successfully in a project in this way (it seems like they are only a bucket for arbitrary as opposed to structured information).
@westonpace, any thoughts on this approach? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree on "function behavior" but I wonder too if this might be a topic that is best to revisit once we've finished later parts of the spec such as function support and write/read operations.
enum PhysicalType { | ||
SYSTEM_DEFAULT = 0; | ||
} | ||
Variation variation = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no protobuf expert but can variation
be made optional? What is the equivalent of SYSTEM_DEFAULT
from before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In proto 3, all items are optional. So system default would be an undefined variation (at least that was my thinking).
| Parent Type | The parent type this variation belongs to. Variations can only be expressed for simple types and wild-carded compound types (e.g. i8 or varchar(*)). | | ||
| Name | The name used to reference this type. Should be unique within type variations for this parent type within an organization. | | ||
| Description | A human description of the purpose of this type variation | | ||
| Function Behavior | **Inherits** or **Independent**: Whether this variation supports functions using the canonical variation or whether functions should be resolved independently. For example if one has the function `add(i8,i8)` defined and then defines an i8 variation, can the i8 variation field be bound to the base `add` operation (inherits) or does a specialized version of `add` need to be defined specifically for this type variation (independent). Defaults to inherits. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "function behavior" is independent
then I would argue it is not a variation but a separate logical type at that point. I may be misunderstanding the definition here though. My thinking of "inherits" means "these two types have the same semantic meaning". So "independent" would mean that the type has a different semantic meaning.
In other words. I see two possible interpretations (neither of which are very appealing)
- Let there be a function F. The term
inherits
means that ifx AS VARIATION v1 == x AS VARIATION v2
andF(x) => y
theny AS VARIATION v1 == y AS VARIATION v2
.- I don't like this interpretation because I think this should be true for every variation no matter what.
- If a consumer supports function F on type
X as variation v1
then it MUST support function F on typeX as variation v2
- I'm not a fan of this one because I think there will always be function signatures that aren't supported by one particular engine or another and the fact is unrelated to variations. Plus, a query planner could always discover this via function introspection and presumably cast to a supported variation (or the consumer itself could support this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine a situation where you have a physical implementation function that only works for dictionary encoded strings (but does not work for the "system default" strings). Both values are logically strings but only a subset of operations may be available for one physical representation. That is the purpose of independent. If you're using inherits, it means that a function defined for the base type would be available for the variation.
To me, part of having physical variations is that there will likely be some operations that only work against one of the variations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I agree those situations may exist. I just don't think Substrait should concern itself with defining which situations are valid or invalid. Here are two concrete examples:
- Arithmetic functions in Arrow work on regular numbers as well as dictionary encoded numbers. This is an example where we support functions for all physical representations.
- The system default timestamp representation may be a 64 bit integer. An alternative is a packed struct with year/month/day/... Field extraction (e.g. get_hour) on the former is expensive. Field extraction on the latter is trivial. Users typically want multiple fields (e.g. it's rare to just ask for the hour, often you want hour, minutes, and seconds). So, an engine may choose to only implement the field extraction kernel for the packed struct representation, forcing a cast.
Is it an error, in Substrait's eyes, if an engine fails to support arithmetic functions on dictionary numbers? Is it an error if an engine chooses to allow field extraction on the system default representation?
I don't think Substrait should care.
site/docs/types/type_variations.md
Outdated
|
||
| Property | Description | | ||
| ----------------- | ------------------------------------------------------------ | | ||
| Parent Type | The parent type this variation belongs to. Variations can only be expressed for simple types and wild-carded compound types (e.g. i8 or varchar(*)). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Logical Type
instead of Parent Type
? When I see Parent
I think tree but there is only one level of nesting here. That being said, I don't feel too strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to avoid using logical since it is overloaded in relational algebra. Maybe "base" instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to base. lmk your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base works for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I figure we can revisit when we start discussing functions in earnest. @westonpace and @emkornfield you good for this to be merged as is?
Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @emkornfield and @westonpace)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's definitely an improvement. I'm fine waiting until we start talking more in depth about functions to worry about the inherits/independent issue.
…t-io#25) * Fix bug causing aggregate function invocation to be dropped * Add visitor to perform deep traversal across child relations
Add new type variation concept with short description
Add sketch example variation definitions
Remove some old unsigned types (cleanup)
This change is