Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DDL for Table Constraints in Presto #20034

Open
ClarenceThreepwood opened this issue Jun 30, 2023 · 0 comments
Open

Support DDL for Table Constraints in Presto #20034

ClarenceThreepwood opened this issue Jun 30, 2023 · 0 comments

Comments

@ClarenceThreepwood
Copy link
Contributor

ClarenceThreepwood commented Jun 30, 2023

1.0 Background
The github issue #16413 introduces a framework where Presto’s optimizer can reason about and infer certain (logical) properties about a query plan and then use them to perform a variety of optimizations. These optimizations can benefit significantly if some of these logical properties are defined as part of the tables’ metadata, as in traditional database systems. In this document I shall lay out a proposal for supporting two well-defined properties/constraints that are supported by a broad range of database systems - Primary Keys and Unique Keys. A related property “Not Null” constraints will also be handled. The purpose of this work is to introduce a systematic approach where Presto can start defining these “informational” constraints on tables and subsequently use them in query optimization. As the query optimizer evolves to handle more sophisticated constraints (e.g. foreign key, check constraint) this framework can also be extended accordingly. This proposal is limited to providing DDL support for constraints. It will not introduce any runtime changes and will only feed into the previously mentioned logical property propagation framework when it is enabled. Constraint validation will also be deferred to a later time.

It is widely accepted that constraints on database tables are a powerful tool for query optimization and enable numerous well documented optimizations that are relevant to industry-standard benchmarks. Since Presto does not possess a metadata catalog of its own, and due to its query federation capabilities, we shall start with end-to-end support for Hive Metastore (Hive connector) for these constraint definitions, but keep them open-ended for extensions/implementations in other connectors.

2.0 Externals

Database vendors have made various extensions to the constraint specification as defined in ANSI SQL. Most of these extensions are to support data validation and transaction semantics. Since the primary goal is to support HMS here, we will follow the constraint specification as laid out by Hive with some modifications that ANSI SQL 2016 mandates. These changes are called out in the grammar described below.
In essence, we shall extend the CREATE TABLE and ALTER TABLE statements to allow the definition of Primary Key, Unique Key and Not Null constraints. For the initial version, this work will add grammar support for all the constraint related keywords and phrases as described by Hive, but will not necessarily support all operations that are semantically entailed (more details below).

The following extensions will be implemented in the Presto grammar.

Create Table: Add support for in-line and out-of-line constraint definitions
In-line constraint definition is when the constraints are defined as part of the column definition
CreateTable : CREATE TABLE <table_name> (<column_specification>, …) …
<column_specification> : <column_name> [<inline_constraint_specification>] …
<inline_constraint_specification> : [NOT NULL| PRIMARY KEY| UNIQUE] [ENABLED|DISABLED] [RELY|NOT RELY] [ENFORCED| NOT ENFORCED]

I.e. the column definition in Presto will be extended to support constraints. This is typically used when the constraints are defined on a single column and does not allow explicit naming of these constraints.

Out-of-line constraint definition is when the constraints are defined after all columns have been declared. It is typically used to declare multi-column constraints and/or user-named constraints.
CreateTable : CREATE TABLE <table_name> (<column_definition>, … <outofline_constraint_specifiction>) … 

<outofline_constraint_specification> : [PRIMARY KEY (<column_name>, …) [ENABLED|DISABLED] [RELY|NOT RELY] [ENFORCED| NOT ENFORCED]]
| [CONSTRAINT <constraint_name> UNIQUE (<column_name>, …) [ENABLED|DISABLED] [RELY|NOT RELY] [ENFORCED| NOT ENFORCED]]

According to the Hive spec the inline constraint definition is
<inline_constraint_specification> : [NOT NULL| PRIMARY KEY| UNIQUE] [ENABLE|DISABLE] [RELY|NORELY] [VALIDATE| NOVALIDATE]
However ANSI SQL requires that the equivalent of "VALIDATE|NOVALIDATE" be "ENFORCED|NOT ENFORCED". Therefore if deference to the sql standard we shall modify all the constraint specifications along these lines in order to be consistent

Alter Table: Table constraints may be added, removed, or have the constraint properties modified via the ALTER TABLE command.

ALTER TABLE <table_name> ADD CONSTRAINT <constraint_name> [PRIMARY KEY| UNIQUE] (<column_name>, …) [ENABLED|DISABLED] [RELY|NOT RELY] [ENFORCED| NOT ENFORCED]
ALTER TABLE <table_name> CHANGE COLUMN <column_name> CONSTRAINT <constraint_name> NOT NULL [ENABLED|DISABLED]
ALTER TABLE DROP CONSTRAINT <constraint_name>

Comparison with other database systems/connectors:

In general, while the actual creation/definition of table constraints is well grounded, some of the analytic dbs of more recent years have custom behavior when it comes to validating and relying on these constraints. Traditional database systems have a well-defined grammar and what is described in the previous section would allow for these connectors to easily implement their own version of table constraints in the future.

Oracle - [https://docs.oracle.com/javadb/10.8.3.0/ref/rrefsqlj13590.html]
Postgres - [https://www.postgresql.org/docs/current/ddl-constraints.html#id-1.5.4.6.6]

As seen from the references above, the syntax for table/column constraints in Postgres and Oracle are straightforward and a strict subset of the Hive grammar. I.e. for the create and alter table statements, PKs , unique and not null constraints are defined as follows

CreateTable : CREATE TABLE <table_name> (<column_specification>, …) …
<column_specification> : <column_name> [<inline_constraint_specification>] …
<inline_constraint_specification> : [NOT NULL| PRIMARY KEY| UNIQUE]

CreateTable : CREATE TABLE <table_name> (<column_definition>, … <outofline_constraint_specifiction>) …
<outofline_constraint_specification> : [PRIMARY KEY (<column_name>, …)
| [CONSTRAINT <constraint_name> UNIQUE (<column_name>, …)

ALTER TABLE <table_name> ADD CONSTRAINT <constraint_name> [PRIMARY KEY| UNIQUE] (<column_name>, …)
ALTER TABLE <table_name> CHANGE COLUMN <column_name> CONSTRAINT <constraint_name> NOT NULL
ALTER TABLE DROP CONSTRAINT <constraint_name>

These databases do not support “disable” or “validate” for PK, Unique or Not Null constraints since the data is validated during the DDL execution and these constraints are maintained across subsequent DML statements. Therefore these constraints may always be relied upon. The grammar for creating constraints in these systems is also ANSI-SQL compatible and therefore identical to the proposed grammar.

In contrast, consider BigQuery’s constraint specification:

CreateTable : CREATE TABLE <table_name> (<column_specification>, …) …
<column_specification> : <column_name> [<inline_constraint_specification>] …
<inline_constraint_specification> : [NOT NULL| PRIMARY KEY] NOT ENFORCED

CreateTable : CREATE TABLE <table_name> (<column_definition>, … <outofline_constraint_specifiction>) …<outofline_constraint_specification> : [PRIMARY KEY (<column_name>, …) NOT ENFORCED

Since BigQuery does not perform data validation, all constraint definitions are required to be annotated with “NOT ENFORCED”. However the constraints themselves may be used by the query optimizer to produce more optimal plans and therefore it is the user’s responsibility to ensure that the data conforms to the declared constraints. ENFORCED is not a keyword that is supported by the Hive grammar. Therefore if we want to support creating tables and constraints in the BigQuery connector, this word would either need to be added to the grammar, or we could map an existing word (say NOVALIDATE) to “NOT ENFORCED”.

The underlying philosophy here is that each connector that wants to support table constraints will need to implement custom logic within to process the statement. If that custom logic does not exist, we will still want to error out gracefully. This is similar to most functionality within the Presto connector framework. In summary, the Hive extensions to support table constraints appear to form a superset of what other query engines and databases support and it should be a safe bet to start with this set. Note that this is not meant to be an exhaustive treatise on all the supported connectors in Presto.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants