This document will explain how to contribute to creating a new Parser with Biome.
Everything starts with writing the grammar for the language. We use a fork of ungrammar
, a simple DSL
for writing grammar for languages.
Internally, we have a codegen script that reads .ungram
files and all the files needed to write a parser.
From the root of this repository, head to xtask/codegen
folder. That's where you'll start doing your work.
Let's say you want to create a new language that has the extension html
, you'll have to follow these instructions:
- Create a new
html.ungram
file in this folder. Add this legend to the.ungram
file.
// This grammar specifies the structure of [LANGUAGE]'s concrete syntax tree.
// It does not specify parsing rules (ambiguities, precedence, etc are out of scope).
// Tokens are processed -- contextual keywords are recognised, compound operators glued.
//
// Legend:
//
// // -- comment
// Name = -- non-terminal definition
// 'ident' -- token (terminal)
// A B -- sequence
// A | B -- alternation
// A* -- zero or more repetition
// (A (',' A)* ','?) -- repetition of node A separated by ',' and allowing a trailing comma
// (A (',' A)*) -- repetition of node A separated by ',' without a trailing comma
// A? -- zero or one repetition
// (A) -- same as A
// label:A -- suggested name for field of AST node
- Create a new file called
src/html_kinds_src.rs
. This file must return a staticKindSrc
. - Create two new crates:
biome_html_syntax
andbiome_html_factory
. Usecargo new --lib crates/biome_html_syntax
. - Create a
generated/
folder inside thesrc/
folder of the newly created crates. - Add a new variant to
LanguageKind
, insidelanguage_kind.rs
file. The new variant will beHtml
. You'll have to implement all methods and cover the new variant. - Add a new prefix
html_
toLANGUAGE_PREFIXES
insidelanguage_kind.rs
. - Once you covered all variants, run the command
cargo codegen grammar
.
- All nodes must start with the prefix of the language, e.g.
HtmlSimpleAttribute
. - Unions of nodes must start with
Any*
, e.g.AnyHtmlAttribute
. - Nodes for enclosing syntax errors must have the Bogus word, e.g.
HtmlBogusAttribute
. - Bogus nodes must be part of a variant, e.g.
AnyHtmlAttribute = HtmlSimpleAttribute HtmlBogusAttribute
- Nodes that represent a list must end with the postfix List, e.g.
HtmlAttributeList
. - Lists are never optional. They are mandatory and empty by default, e.g.
HtmlTag = attributes: HtmlAttributeList
Once you're finished, you'll have to generate the files from the grammar. Use the command just gen-grammar
. This command accepts a list of known languages. If you don't pass anything, the command will generate all the available grammars.
# generate grammar of the HTML language
just gen-grammar html
# generate grammar of the HTML and CSS languages
just gen-grammar html css
# generate grammar of all languages
just gen-grammar
There are generally three actors you'll need to have in order to create a parser in Biome:
The lexer is the entity in charge of consuming each character coming from the source code and emitting "tokens".
The lexer is the lower primitive of a Biome parser. It usually consumes characters as bytes, but it can also consume characters with different encodings, e.g., UTF-8, UTF-16, etc.
Create a lexer/mod.rs
inside the parser crate.
use biome_beta_syntax::BetaSyntaxKind;
use biome_parser::ParseDiagnostic;
pub(crate) struct BetaLexer<'source> {
/// Source text
source: &'source str,
/// The start byte position in the source text of the next token.
position: usize,
/// the current token
current_kind: BetaSyntaxKind,
/// diagnostics emitted during the parsing phase
diagnostics: Vec<ParseDiagnostic>,
}
The import and implementation of the Lexer
trait from the biome_parser
crate.
impl<'source> Lexer<'source> for BetaLexer<'source> {}
The Lexer
requires a bunch of functions to be implemented.
The token source is the second brick needed to create a parser. The token source is a thin layer that wraps the lexer, and it implements few functionalities:
- Lookahead: check for possible next tokens/characters without advancing the parsing.
- Re-lexing: a feature that allows the consumption of tokens with a different context.
- Checkpoint: save the current status of the lexing phase, and restore it if a certain parsing logic is incorrect.
If you want to have lookahead, you need to:
- Wrap your lexer with a
BufferedLexer
. - Implement
TokenSourceWithBufferedLexer
for your token source. - Implement
LexerWithCheckpoint
for your lexer.
use biome_parser::lexer::BufferedLexer;
use biome_beta_syntax::BetaSyntaxKind;
use crate::lexer::{BetaLexer};
pub(crate) struct BetaTokenSource<'src> {
lexer: BufferedLexer<BetaSyntaxKind, BetaLexer<'src>>,
}
impl<'source> TokenSourceWithBufferedLexer<BetaLexer<'source>> for BetaTokenSource<'source> {
fn lexer(&mut self) -> &mut BufferedLexer<BetaSyntaxKind, BetaLexer<'source>> {
&mut self.lexer
}
}
A parser is a struct
that must implement the Parser
trait.
Notably, the struct
save the token source, the parser context and possible options:
pub(crate) struct BetaParser<'source> {
context: ParserContext<BetaSyntaxKind>,
source: BetaTokenSource<'source>,
// optional, only if the parser is meant to have some options
options: BetaParserOptions,
}
This is a short, or not so short, guide to implement parse rules using the Biome parser infrastructure.
The convention is to prefix your parse rule with parse_
and then use the name defined in the grammar file.
For example, parse_for_statement
or parse_expression
.
Most parse rules take a &mut
reference to the parser as their only parameter and return a ParsedSyntax
.
fn parse_rule_name(&mut: Parser) -> ParsedSyntax {}
You're free to add additional parameters to your function if needed. There are rare cases where you want to consider returning ConditionalParsedSyntax
as explained in conditional syntax
Let's assume you want to parse the JS if
statement:
JsIfStatement =
if
(
test: JsAnyExpression
)
consequent: JsBlockStatement
else_clause: JsElseClause?
Now, the parsing function must first test if the parser is positioned at an if
statement and return Absent
if that's not the case.
if !p.at(T![if]) {
return ParsedSyntax::Absent;
}
Why return ParsedSyntax::Absent
? The function must return ParsedSyntax::Absent
if the rule can't predict by the next token(s) if they form the expected node or not. Doing so allows the calling rule to decide if this is an error and perform an error recovery if necessary. The second reason is to ensure that the rule doesn't return a node where all children are missing.
Your rule implementation may want to consider more than just the first child to determine if it can parse at least some of the expected children.
For example, the if statement rule could test if the parser is located at an else
clause and then create an if
statement where all children are missing except the else
clause:
if !p.at(T![if]) && !p.at(T![else]){
return Absent
}
Your implementation can also call into another parsing rule if the first child is a node and not a token.
let assignment_target = parse_assignment_target(p);
if assignment_target.is_absent() {
return Absent;
}
let my_node = assignment_target.precede_or_missing();
But be careful with calling other rules. Your rule mustn't progress the parser - meaning that it can't
advance in the parsing process and consume tokens - if it returns Absent
.
The parse rules will guide you in how to write your implementation and the parser infrastructure provides the following convenience APIs:
- Optional token
'ident'?
: Usep.eat(token)
. It eats the next token if it matches the passed-in token. - Required token
'ident'
: Usep.expect(token)
. It eats the next token if it matches the passed-in token. It adds anExpected 'x' but found 'y' instead
error and a missing marker if the token isn't present in the source code. - Optional node
body: JsBlockStatement?
: Useparse_block_statement(p).ok(p)
. It parses the block if it is present in the source code and adds a missing marker if it isn't. - Required node
body: JsBlockStatement
: Useparse_block_statement(p).or_add_diagnostic(p, error_builder)
: it parses the block statement if it is present in the source code and adds a missing marker and an error if not.
Using the above-described rules result in the following implementation for the if
statement rule.
fn parse_if_statement(p: &mut Parser) -> ParsedSyntax {
if !p.at(T![if]) {
return Absent;
}
let m = p.start();
p.expect(T![if]);
p.expect(T!['(']);
parse_any_expression(p).or_add_diagnostic(p, js_parse_errors::expeced_if_statement);
p.expect(T![')']);
parse_block_statement(p).or_add_diagnostic(p, js_parse_errors::expected_block_statement);
// the else block is optional, handle the marker by using `ok`
parse_else_clause(p).ok();
Present(m.complete(p, JS_IF_STATEMENT));
}
Hold on, what are these missing markers? Biome's AST facade uses fixed offsets to retrieve a particular child from a node.
For example, the 3rd child of the if statement is the condition. However, the condition would become the second element
if the opening parentheses (
isn't present in the source text. That's where missing elements come into play.
Performance-Neutral Error Recovery: implement an error recovery mechanism that does not degrade the parsing performance of valid code. This ensures that the parser remains efficient while being more forgiving of errors. We can try to check if the next token is a valid item for a list (e.g. we can use
is_at_item
to check if we have a missing end list token), however versus merely checking for a end list token does introduce a performance consideration, especially in well-formed documents where syntax errors are rare.Preservation of Valid Tree Structure: modify the parser to retain as much information from the valid parts of the AST tree as possible. Even when encountering invalid parts, the parser should mark them as 'bogus' rather than invalidating the parent node. This approach minimizes the loss of useful information due to isolated syntax errors.
Parsing lists is different from parsing single elements with a fixed set of children because it requires looping until the parser reaches a terminal token (or the end of the file).
You may remember that parse_*
methods shouldn't progress parsing if they return Absent
.
Not progressing the parser is problematic inside while
loops because it inevitably results in an infinite loop.
That's why you must do error recovery when parsing lists. Luckily, the parser comes with the infrastructure to make error recovery a piece of cake. The general structure for parsing a list is (yes, that's something the parser infrastructure should provide for you):
Let's try to parse an array:
[ 1, 3, 6 ]
We will use ParseSeparatedList
in order to achieve that
struct ArrayElementsList;
impl ParseSeparatedList for ArrayElementsList {
type ParsedElement = CompletedMarker;
fn parse_element(&mut self, p: &mut Parser) -> ParsedSyntax<Self::ParsedElement> {
parse_array_element(p)
}
fn is_at_list_end(&self, p: &mut Parser) -> bool {
p.at_ts(token_set![T![default], T![case], T!['}']])
}
fn recover(
&mut self,
p: &mut Parser,
parsed_element: ParsedSyntax<Self::ParsedElement>,
) -> parser::RecoveryResult {
parsed_element.or_recover(
p,
&ParseRecoveryTokenSet::new(JS_BOGUS_STATEMENT, STMT_RECOVERY_SET),
js_parse_error::expected_statement,
)
}
}
Let's run through this step by step:
parsed_element.or_recover(
p,
&ParseRecoveryTokenSet::new(JS_BOGUS_STATEMENT, STMT_RECOVERY_SET),
js_parse_error::expected_case,
)
The or_recover
performs an error recovery if the parse_array_element
method returns Absent
;
there's no array element in the source text.
The recovery eats all tokens until it finds one of the tokens specified in the token_set
,
a line break (if you called enable_recovery_on_line_break
) or the end of the file.
The recovery doesn't throw the tokens away but instead wraps them inside a JS_BOGUS_EXPRESSION
node (first parameter).
There exist multiple BOGUS_*
nodes. You must consult the grammar to understand which BOGUS*
node is supported in your case.
You usually want to include the terminal token ending your list, the element separator token, and the token terminating a statement in your recovery set.
Now, the problem with recovery is that it can fail, and there are two reasons:
- the parser reached the end of the file;
- the next token is one of the tokens specified in the recovery set, meaning there is nothing to recover from;
In these cases the ParseSeparatedList
and ParseNodeList
will recover the parser for you.
The conditional syntax allows you to express that some syntax may not be valid in all source files. Some use cases are:
- syntax that is only supported in strict or sloppy mode: for example,
with
statements is not valid when a JavaScript file uses"use strict"
or is a module; - syntax that is only supported in certain file types: Typescript, JSX, modules;
- syntax that is only available in specific language versions: experimental features, different versions of the language e.g. (ECMA versions for JavaScript);
The idea is that the parser always parses the syntax regardless of whatever it is supported in this specific file or context. The main motivation behind doing so is that this gives us perfect error recovery and allows us to use the same code regardless of whether the syntax is supported.
However, conditional syntax must be handled because we want to add a diagnostic if the syntax isn't supported for the current file, and the parsed tokens must be attached somewhere.
Let's have a look at the with
statement that is only allowed in loose mode/sloppy mode:
fn parse_with_statement(p: &mut Parser) -> ParsedSyntax {
if !p.at(T![with]) {
return Absent;
}
let m = p.start();
p.bump(T![with]); // with
parenthesized_expression(p).or_add_diagnostic(p, js_errors::expected_parenthesized_expression);
parse_statement(p).or_add_diagnostic(p, js_error::expected_statement);
let with_stmt = m.complete(p, JS_WITH_STATEMENT);
let conditional = StrictMode.excluding_syntax(p, with_stmt, |p, marker| {
p.err_builder("`with` statements are not allowed in strict mode", marker.range(p))
});
}
The start of the rule is the same as for any other rule. The exciting bits start with
fn parse_something() {
let conditional = StrictMode.excluding_syntax(p, with_stmt, |p, marker| {
p.err_builder("`with` statements are not allowed in strict mode", marker.range(p))
});
}
The StrictMode.excluding_syntax
converts the parsed syntax to a bogus node and uses the diagnostic builder to create a diagnostic if the feature is not supported.
You can convert the ConditionalParsedSyntax
to a regular ParsedSyntax
by calling or_invalid_to_bogus
, which wraps the whole parsed with
statement in an BOGUS
node if the parser is in strict mode and otherwise returns the unchanged with
statement.
What if there's no BOGUS
node matching the node of your parse rule? You must then return the ConditionalParsedSyntax
without making the or_invalid_to_bogus
recovery. It's then up to the caller to recover the potentially invalid syntax.
- Parse rules are named
parse_rule_name
- The parse rules should return a
ParsedSyntax
- The rule must return
Present
if it consumes any token and, therefore, can parse the node with at least some of its children. - It returns
Absent
otherwise and must not progress parsing nor add any errors. - Lists must perform error recovery to avoid infinite loops.
- Consult the grammar to identify the
BOGUS
node that is valid in the context of your rule.