- Table of Contents
This document aims to familiarize the reader with SQLGlot's codebase & architecture.
Generally, some background knowledge about programming languages / compilers is required. A good starting point is Crafting Interpreters by Robert Nystrom, which served as the foundation when SQLGlot was initially created.
At a high level, SQLGlot transpilation involves three modules:
- Tokenizer: Converts raw code into a sequence of “tokens”, one for each word/symbol
- Parser: Converts sequence of tokens into an abstract syntax tree representing the semantics of the raw code
- Generator: Converts an abstract syntax tree into SQL code
SQLGlot can transpile SQL between different SQL dialects, which are usually associated with different database systems.
Each dialect has unique syntax that is implemented by overriding the base versions of the three modules. A dialect definition may override any or all of the three base modules.
The base versions of the modules implement the sqlglot
dialect (sometimes referred to as "base dialect"), which is designed to accommodate as many common syntax elements as possible across the other supported dialects, thus preventing code duplication.
SQLGlot includes other modules which are not required for basic transpilation, such as the Optimizer and Executor. The Optimizer modifies an abstract syntax tree for other uses (e.g., inferring column-level lineage). The Executor runs SQL code in Python, but its engine is still in an experimental stage, so some functionality may be missing.
The rest of this document describes the three base modules, dialect-specific overrides, and the Optimizer module.
The tokenizer module (tokens.py
) is responsible for breaking down SQL code into tokens (like keywords, identifiers, etc.), which are the smallest units of meaningful information. This process is also known as lexing/lexical analysis.
Python is not designed for maximum processing speed/performance, so SQLGlot maintains an equivalent Rust version of the tokenizer in sqlglotrs/tokenizer.rs
for improved performance.
Important
Changes in the tokenization logic must be reflected in both the Python and the Rust tokenizer. The goal is for the two implementations to be similar so that there is less cognitive effort to port code between them.
This example demonstrates the results of using the Tokenizer to tokenize the SQL query SELECT b FROM table WHERE c = 1
:
from sqlglot import tokenize
tokens = tokenize("SELECT b FROM table WHERE c = 1")
for token in tokens:
print(token)
# <Token token_type: <TokenType.SELECT: 'SELECT'>, text: 'SELECT', line: 1, col: 6, start: 0, end: 5, comments: []>
# <Token token_type: <TokenType.VAR: 'VAR'>, text: 'b', line: 1, col: 8, start: 7, end: 7, comments: []>
# <Token token_type: <TokenType.FROM: 'FROM'>, text: 'FROM', line: 1, col: 13, start: 9, end: 12, comments: []>
# <Token token_type: <TokenType.TABLE: 'TABLE'>, text: 'table', line: 1, col: 19, start: 14, end: 18, comments: []>
# <Token token_type: <TokenType.WHERE: 'WHERE'>, text: 'WHERE', line: 1, col: 25, start: 20, end: 24, comments: []>
# <Token token_type: <TokenType.VAR: 'VAR'>, text: 'c', line: 1, col: 27, start: 26, end: 26, comments: []>
# <Token token_type: <TokenType.EQ: 'EQ'>, text: '=', line: 1, col: 29, start: 28, end: 28, comments: []>
# <Token token_type: <TokenType.NUMBER: 'NUMBER'>, text: '1', line: 1, col: 31, start: 30, end: 30, comments: []>
The Tokenizer scans the query and converts groups of symbols (or lexemes), such as words (e.g., b
) and operators (e.g., =
), into tokens. For example, the VAR
token is used to represent the identifier b
, whereas the EQ
token is used to represent the "equals" operator.
Each token contains information such as its type (token_type
), the lexeme (text
) it encapsulates and other metadata such as the lexeme's line (line
) and column (col
), which are used to accurately report errors.
SQLGlot’s TokenType
enum provides an indirection layer between lexemes and their types. For example, !=
and <>
are often used interchangeably to represent "not equals", so SQLGlot groups them together by mapping them both to TokenType.NEQ
.
The Tokenizer.KEYWORDS
and Tokenizer.SINGLE_TOKENS
are two important dictionaries that map lexemes to the corresponding TokenType
enum values.
The parser module takes the list of tokens generated by the tokenizer and builds an abstract syntax tree (AST) from them.
Generally, an AST stores the meaningful constituent components of a SQL statement. For example, to represent a SELECT
query we can only store its projections, filters, aggregation expressions and so on, but the SELECT
, WHERE
and GROUP BY
keywords don't need to be preserved, since they are implied.
In SQLGlot's case, the AST usually stores information that captures the semantics of an expression, i.e. its meaning, which is what allows it to transpile SQL code between different dialects. A blog post that explains this idea at a high-level can be found here.
The parser processes the tokens corresponding to a SQL statement sequentially and produces an AST to represent it. For example, if a statement’s first token is CREATE
, the parser will determine what is being created by examining the next token (which could be SCHEMA
, TABLE
, VIEW
, or something else).
Each semantic concept is represented by an expression in the AST. Converting tokens to expressions is the primary task of the parser.
As the AST is a core concept of SQLGlot, a more detailed tutorial on its representation, traversal, and transformation can be found here.
The next sections describe how the parser processes a SQL statement’s list of tokens, followed by an explanation of how dialect-specific parsing works.
The parser processes a token list sequentially, and maintains an index/cursor that points to the next token to be consumed.
Once the current token has been successfully processed, the parser moves to the next token in the sequence by calling _advance()
to increment the index.
If a token has been consumed but could not be parsed, the parser can move backward to the previous token by calling _retreat()
to decrement the index.
In some scenarios, the parser’s behavior depends on both the current and surrounding tokens. The parser can “peek” at the previous and next tokens without consuming them by accessing the _prev
and _next
properties, respectively.
When parsing a SQL statement, the parser must identify specific sets of keywords to correctly build the AST. It does so by “matching” sets of tokens or keywords to infer the structure of the statement.
For example, these matches occur when parsing the window specification SUM(x) OVER (PARTITION BY y)
:
- The
SUM
and(
tokens are matched andx
is parsed as an identifier - The
)
token is matched - The
OVER
and(
tokens are matched - The
PARTITION
andBY
tokens are matched - The partition clause
y
is parsed - The
)
token is matched
The family of _match
methods perform different varieties of matching. They return True
and advance the index if there is a match, or return None
if the tokens do not match.
The most commonly used _match
methods are:
Method | Use |
---|---|
_match(type) |
Attempt to match a single token with a specific TokenType |
_match_set(types) |
Attempt to match a single token in a set of TokenType s |
_match_text_seq(texts) |
Attempt to match a continuous sequence of strings/texts |
_match_texts(texts) |
Attempt to match a keyword in a set of strings/texts |
SQLGlot’s parser follows the “recursive descent” approach, meaning that it relies on mutually recursive methods in order to understand SQL syntax and the various precedence levels between combined SQL expressions.
For example, the SQL statement CREATE TABLE a (col1 INT)
will be parsed into an exp.Create
expression that includes the kind of object being created (table) and its schema (table name, column names and types). The schema will be parsed into an exp.Schema
node that contains one column definition exp.ColumnDef
node for each column.
The relationships between SQL constructs are encoded in methods whose names begin with “parse”, such as _parse_create()
. We refer to these as “parsing methods”.
For instance, to parse the statement above the entrypoint method _parse_create()
is called. Then:
- In
_parse_create()
, the parser determines that aTABLE
is being created - Because a table is being created, the table name is expected next and thus
_parse_table_parts()
is invoked; This is because the table name may have multiple parts such ascatalog.schema.table
- Having processed the table name, the parser continues by parsing the schema definition using
_parse_schema()
A key aspect of the SQLGlot parser is that the parsing methods are composable and can be passed as arguments. We describe how that works in the next section.
SQLGlot provides helper methods for common parsing tasks. They should be used whenever possible to reduce code duplication and manual token/text matching.
Many helper methods take a parsing method argument, which is invoked to parse part(s) of the clause they're supposed to parse.
For example, these helper methods parse collections of SQL objects like col1, col2
, (PARTITION BY y)
, and (colA TEXT, colB INT)
, respectively:
Method | Use |
---|---|
_parse_csv(parse_method) |
Attempt to parse comma-separated values using the appropriate parse_method callable |
_parse_wrapped(parse_method) |
Attempt to parse a specific construct that is (optionally) wrapped in parentheses |
_parse_wrapped_csv(parse_method) |
Attempt to parse comma-separated values that are (optionally) wrapped in parentheses |
The SQL language specification and dialect variations are vast, and SQLGlot cannot parse every possible statement.
Instead of erroring on statements it cannot parse, SQLGlot falls back to storing the code that cannot be parsed in an exp.Command
expression.
This allows SQLGlot to return the code unmodified even though it cannot parse it. Because the code is unmodified, any dialect-specific components will not be transpiled.
The base parser’s goal is to represent as many common constructs from different SQL dialects as possible. This makes the parser more lenient and less-repetitive/concise.
Warning
SQLGlot does not aim to be a SQL validator, so it may not detect certain syntax errors.
Dialect-specific parser behavior is implemented in two ways: feature flags and parser overrides.
If two different parsing behaviors are common across dialects, the base parser may implement both and use feature flags to determine which should be used for a specific dialect. In contrast, parser overrides directly replace specific base parser methods.
Therefore, each dialect’s Parser class may specify:
-
Feature flags such as
SUPPORTS_IMPLICIT_UNNEST
orSUPPORTS_PARTITION_SELECTION
, which are used to control the behavior of base parser methods. -
Sets of tokens that play a similar role, such as
RESERVED_TOKENS
that cannot be used as object names. -
Sets of
token -> Callable
mappings, implemented with Python lambda functions- When the parser comes across the token (string or
TokenType
) on the left-hand side it invokes the mapping function on the right to create the corresponding Expression / AST node. - The lambda function may return an expression directly or a
_parse_ method
that determines what expression should be returned. - The mappings are grouped by general semantic type (e.g., functions, parsers for
ALTER
statements, etc.), with each type having its own dictionary.
- When the parser comes across the token (string or
An example token -> Callable
mapping is the FUNCTIONS
dictionary that builds the appropriate exp.Func
node based on the string key:
FUNCTIONS: t.Dict[str, t.Callable] = {
"LOG2": lambda args: exp.Log(this=exp.Literal.number(2), expression=seq_get(args, 0)),
"LOG10": lambda args: exp.Log(this=exp.Literal.number(10), expression=seq_get(args, 0)),
"MOD": build_mod,
…,
}
In this dialect, the LOG2()
function calculates a logarithm of base 2, which is represented in SQLGlot by the general exp.Log
expression.
The logarithm base and the user-specified value to log
are stored in the node's arguments. The former is stored as an integer literal in the this
argument, and the latter is stored in the expression
argument.
After the parser has created an AST, the generator module is responsible for converting it back into SQL code.
Each dialect’s Generator class may specify:
-
A set of flags such as
ALTER_TABLE_INCLUDE_COLUMN_KEYWORD
. As with the parser, these flags control the behavior of base Generator methods. -
Set of
Expression -> str
mappings- The generator traverses the AST recursively and produces the equivalent string for each expression it visits. This can be thought of as the reverse operation of the parser, where expressions / AST nodes are converted back to strings.
- The mappings are grouped by general semantic type (e.g., data type expressions), with each type having its own dictionary.
The Expression -> str
mappings can be implemented as either:
- Entries in the
TRANSFORMS
dictionary, which are best suited for single-line generations. - Functions with names of the form
<expr_name>_sql(...)
. For example, to generate SQL for anexp.SessionParameter
node we can override the base generator method by defining the methoddef sessionparameter_sql(...)
in a dialect’s Generator class.
The key method in the Generator module is the sql()
function, which is responsible for generating the string representation of each expression.
First, it locates the mapping of expression to string or generator Callable
by examining the keys of the TRANSFORMS
dictionary and searching the Generator
's methods for an <exprname>_sql()
method.
It then calls the appropriate callable to produce the corresponding SQL code.
Important
If there is both a TRANSFORM
entry and an <expr_name>_sql(...)
method for a given expression, the generator will use the former to convert the expression into a string.
The Generator defines helper methods containing quality-of-life abstractions over the sql()
method, such as:
Method | Use |
---|---|
expressions() |
Generates the string representation for a list of Expression s, employing a series of arguments to help with their formatting |
func() , rename_func() |
Given a function name and an Expression , returns the string representation of a function call by generating the expression's args as the func args. |
These methods should be used whenever possible to reduce code duplication.
Pretty printing refers to the process of generating formatted and consistently styled SQL, which improves readability, debugging, and code reviewing.
SQLGlot generates an AST’s SQL code on a single line by default, but users may specify that it be pretty
and include new lines and indenting.
It is up to the developer to ensure that whitespaces and newlines are embedded in the SQL correctly. To aid in that process, the Generator class provides the sep()
and seg()
helper methods.
We briefly mentioned dialect-specific behaviors while describing the SQLGlot base Tokenizer, Parser, and Generator modules. This section expands on how to specify a new SQL dialect.
As described above, SQLGlot attempts to bridge dialect-specific variations in a "superset" dialect, which we refer to as the sqlglot dialect.
All other SQL dialects have their own Dialect subclass whose Tokenizer, Parser and Generator components extend or override the base modules as needed.
The base dialect components are defined in tokens.py
, parser.py
and generator.py
. Dialect definitions can be found under dialects/<dialect>.py
.
When adding features that will be used by multiple dialects, it's best to place the common/overlapping parts in the base sqlglot dialect to prevent code repetition across other dialects. Any dialect-specific features that cannot (or should not) be repeated can be specified in the dialect’s subclass.
The Dialect class contains flags that are visible to its Tokenizer, Parser and Generator components. Flags are only added in a Dialect when they need to be visible to at least two components, in order to avoid code duplication.
Creating a new SQL dialect may seem complicated at first, but it is actually quite simple in SQLGlot.
This example demonstrates defining a new dialect named Custom
that extends or overrides components of the base Dialect modules:
from sqlglot import exp
from sqlglot.dialects.dialect import Dialect
from sqlglot.generator import Generator
from sqlglot.tokens import Tokenizer, TokenType
class Custom(Dialect):
class Tokenizer(Tokenizer):
QUOTES = ["'", '"'] # Strings can be delimited by either single or double quotes
IDENTIFIERS = ["`"] # Identifiers can be delimited by backticks
# Associates certain meaningful words with tokens that capture their intent
KEYWORDS = {
**Tokenizer.KEYWORDS,
"INT64": TokenType.BIGINT,
"FLOAT64": TokenType.DOUBLE,
}
class Parser(Parser):
# Specifies how certain tokens are mapped to function AST nodes
FUNCTIONS = {
**parser.Parser.FUNCTIONS,
"APPROX_PERCENTILE": exp.ApproxQuantile.from_arg_list,
}
# Specifies how a specific construct e.g. CONSTRAINT is parsed
def _parse_constraint(self) -> t.Optional[exp.Expression]:
return super()._parse_constraint() or self._parse_projection_def()
class Generator(Generator):
# Specifies how AST nodes, i.e. subclasses of exp.Expression, should be converted into SQL
TRANSFORMS = {
exp.Array: lambda self, e: f"[{self.expressions(e)}]",
}
# Specifies how AST nodes representing data types should be converted into SQL
TYPE_MAPPING = {
exp.DataType.Type.TINYINT: "INT64",
exp.DataType.Type.SMALLINT: "INT64",
...
}
Even though this example is a fairly realistic starting point, we strongly encourage you to study existing dialect implementations to understand how their various components can be modified.
Previous sections described the SQLGlot components used for transpilation. This section and the next describe components needed to implement other SQLGlot functionality like column-level lineage.
A Schema represents the structure of a database schema, including the tables/views it contains and their column names and data types.
The file schema.py
defines the abstract classes Schema
and AbstractMappingSchema
; the implementing class is MappingSchema
.
A MappingSchema
object is defined with a nested dictionary, as in this example:
schema = MappingSchema({"t": {"x": "int", "y": "datetime"}})
The keys of the top-level dictionary are table names (t
), the keys of t
’s dictionary are its column names (x
and y
), and the values are the column data types (int
and datetime
).
A schema provides information required by other SQLGlot modules (most importantly, the optimizer and column level lineage).
Schema information is used to enrich ASTs with actions like replacing stars (SELECT *
) with the names of the columns being selected from the upstream tables.
The optimizer module in SQLGlot (optimizer.py
) is responsible for producing canonicalized and efficient SQL queries by applying a series of optimization rules.
Optimizations include simplifying expressions, removing redundant operations, and rewriting queries for better performance.
The optimizer operates on the abstract syntax tree (AST) returned by the parser, transforming it into a more compact form while preserving the semantics of the original query.
Note
The optimizer performs only logical optimization; The underlying engine is almost always better at optimizing for performance.
The optimizer essentially applies a list of rules in a sequential manner, where each rule takes in an AST and produces a canonicalized and optimized version of it.
Warning
The rule order is important, as some rules depend on others to work properly. We discourage manually applying individual rules, which may result in erroneous or unpredictable behavior.
The first, foundational optimization task is to standardize an AST, which is required by other optimization steps. These rules implement canonicalization:
The most important rule in the optimizer is qualify
, which is responsible for rewriting the AST such that all tables and columns are normalized and qualified.
The normalization step (normalize_identifiers.py
) transforms some identifiers by converting them into lower or upper case, ensuring the semantics are preserved in each case (e.g. by respecting case-sensitivity).
This transformation reflects how identifiers would be resolved by the engine corresponding to each SQL dialect, and plays a very important role in the standardization of the AST.
Note
Some dialects (e.g. DuckDB) treat identifiers as case-insensitive even when they're quoted, so all identifiers are normalized.
Different dialects may have different normalization strategies. For instance, in Snowflake unquoted identifiers are normalized to uppercase.
This example demonstrates how the identifier Foo
is normalized to lowercase foo
in the default sqlglot dialect and uppercase FOO
in the Snowflake dialect:
import sqlglot
from sqlglot.optimizer.normalize_identifiers import normalize_identifiers
normalize_identifiers("Foo").sql()
# 'foo'
normalize_identifiers("Foo", dialect="snowflake").sql("snowflake")
# 'FOO'
After normalizing the identifiers, all table and column names are qualified so there is no ambiguity about the data sources referenced by the query.
This means that all table references are assigned an alias and all column names are prefixed with their source table’s name.
This example demonstrates qualifying the query SELECT col FROM tbl
, where tbl
contains a single column col
. Note that the table’s schema is passed as an argument to qualify()
:
import sqlglot
from sqlglot.optimizer.qualify import qualify
schema = {"tbl": {"col": "INT"}}
expression = sqlglot.parse_one("SELECT col FROM tbl")
qualify(expression, schema=schema).sql()
# 'SELECT "tbl"."col" AS "col" FROM "tbl" AS "tbl"'
The qualify
rule also offers a set of sub-rules that further canonicalize the query.
For instance, qualify()
can expand stars such that each SELECT *
is replaced by a projection list of the selected columns.
This example modifies the previous example’s query to use SELECT *
. qualify()
expands the star and returns the same output as before:
import sqlglot
from sqlglot.optimizer.qualify import qualify
schema = {"tbl": {"col": "INT"}}
expression = sqlglot.parse_one("SELECT * FROM tbl")
qualify(expression, schema=schema).sql()
# 'SELECT "tbl"."col" AS "col" FROM "tbl" AS "tbl"'
Some SQL operators’ behavior changes based on the data type of its inputs.
For example, in some SQL dialects +
can be used for either adding numbers or concatenating strings. The database uses its knowledge of column data types to determine which operation should be executed in a specific situation.
Like the database, SQLGlot needs to know column data types to perform some optimizations. In many cases, both transpilation and optimization need type information to work properly.
Type annotation begins with user-provided information about column data types for at least one table. SQLGlot then traverses the AST, propagating that type information and attempting to infer the type returned by each AST expression.
Users may need to provide column type information for multiple tables to successfully annotate all expressions in the AST.
Column-level lineage (CLL) traces the flow of data from column to column as tables select and modify columns from one another in a SQL codebase.
CLL provides critical information about data lineage that helps in understanding how data is derived, transformed, and used across different parts of a data pipeline or database system.
SQLGlot’s CLL implementation is defined in lineage.py
. It operates on a collection of queries that SELECT
from one another. It assumes that the graph/network of queries forms a directed acyclic graph (DAG) where no two tables SELECT
from each other.
The user must provide: A “target” query whose upstream column lineage should be traced The queries linking all tables upstream of the target query The schemas (column names and types) of the root tables in the DAG
In this example, we trace the lineage of the column traced_col
:
from sqlglot.lineage import lineage
target_query = “””
WITH cte AS (
SELECT
traced_col
FROM
intermediate_table
)
SELECT
traced_col
FROM
cte
“””
node = lineage(
column="traced_col",
sql=target_query,
sources={"intermediate_table": "SELECT * FROM root_table"},
schema={"root_table": {"traced_col": "int"}},
)
The lineage()
function’s sql
argument takes the target query, the sources
argument takes a dictionary of source table names and the query that produces each table, and the schema
argument takes the column names/types of the graph’s root tables.
This section describes how SQLGlot traces the lineage in the previous example.
Before lineage can be traced, the following preparatory steps must be carried out:
- Replace the
sql
query’s table references with theirsources
’ queries, such that we have a single standalone query.
This example demonstrates how the earlier example’s reference to intermediate_table
inside the CTE is replaced with the query that generated it, SELECT * FROM root_table
, and given the alias intermediate_table
:
WITH cte AS (
SELECT
traced_col
FROM (
SELECT
*
FROM
root_table
) AS intermediate_table
)
SELECT
traced_col
FROM
cte
- Qualify the query produced by the earlier step, expanding
*
s and qualifying all column references with the corresponding source tables:
WITH cte AS (
SELECT
intermediate_table.traced_col
FROM (
SELECT
root_table.traced_col
FROM
root_table AS root_table
) AS intermediate_table
)
SELECT
cte.traced_col
FROM
cte AS cte
After these operations, the query is canonicalized and the lineage can be traced.
Tracing is a top-down process, meaning that the search starts from cte.traced_col
(outer scope) and traverse inwards to find that its origin is intermediate_table.traced_col
, which in turn is based on root_table.traced_col
.
The search continues until the innermost scope has been visited. At that point all sources
are exhausted, so the schema
is searched to validate that the root table (root_table
in our example) exists and that the target column traced_col
is defined in it.
At each step, the column of interest (i.e., cte.traced_col
, intermediate_table.traced_col
. etc) is wrapped in a lineage object Node
that is linked to its upstream nodes.
This approach incrementally builds a linked list of Nodes
that traces the lineage for the target column: root_table.traced_col -> intermediate_table.traced_col -> cte.traced_col
.
The lineage()
output can be visualized using the node.to_html()
function, which generates an HTML lineage graph using vis.js
.