Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Dialect level precedence, update Postgres Dialect to match Postgres #1360

Merged
merged 5 commits into from
Aug 6, 2024

Conversation

samuelcolvin
Copy link
Contributor

@samuelcolvin samuelcolvin commented Aug 1, 2024

This is ugly but working, I wanted to get feedback before proceeding. I think this is now in a good state, definitely ready to review.

This solves #814 (comment), e.g. it gives the right AST for SELECT foo->'bar'='spam' etc.

But before I go further, I want to confirm a few things?

Can we update precedence globally?

The claim at

https://github.com/sqlparser-rs/sqlparser-rs/blob/cc13841a370190df00cfc623593453054ac187e9/src/parser/mod.rs#L2972

Isn't really true at all, sqlparser-rs doesn't use the lexical precedence from PostgreSQL 7, even if it did, that's extremely old.

Could we just update those values to reference a more recent version?

I assume not since that order is presumably a not-too-bad approximation for lots of databases, and changing it would be a big breaking change

Can we even update PostgresSQL dialect

Presumably people are relying on the current logic, will you even accept a PR to update PostgreSqlDialect even if it's wrong now?

Or should a create a RecentPostgreSqlDialect as a new dialect?

What about operators that don't work in some dialects?

Postgres doesn't actually support XOR, but there are tests asserting that it works, do you want to keep supporting it with an invented precedence, or cause an error?

Can I move precedence numbers to the Dialect trait?

The Precedence enum shown here is pretty ugly, would you mind me moving the precedence stuff, including the default implementation of get_next_precedence to the Dialect trait?

@samuelcolvin samuelcolvin marked this pull request as ready for review August 1, 2024 15:50
@samuelcolvin
Copy link
Contributor Author

The above questions stand, but I think this is in a state where it could at least be reviewed. @alamb could you kick off tests.

I'm going to use it in Logfire via a git ref now.

@@ -300,13 +304,172 @@ pub trait Dialect: Debug + Any {
// return None to fall back to the default behavior
None
}

/// Get the precedence of the next token. This "full" method means all precedence logic and remain
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved to Dialect so dialects can change the precedence without all the numeric constants being shared between modules.

it's implemented as a separate function from get_next_precedence:

  1. To avoid breaking the API
  2. To allow dialects to customise precedence but optionally fall back to the default behaviour, as demonstrated Snowflake.


/// The following precedence values are used directly by `Parse` or in dialects,
/// so have to be made public by the dialect.
fn prec_double_colon(&self) -> u8 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are the precedence values that either Parser needs to know about, or are used by implementations of the trait, e.g. prec_double_colon is used by Snowflake.

}

// Define the lexical Precedence of operators.
//
// Uses (APPROXIMATELY) <https://www.postgresql.org/docs/7.0/operators.htm#AEN2026> as a reference
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement really isn't true, hence I added APPROXIMATELY.

We could rewrite to "was originally inspired by" or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine

Comment on lines +31 to +32
// there's no XOR operator in PostgreSQL, but support it here to avoid breaking tests
const XOR_PREC: u8 = 75;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what to do about this, if we remove XOR logic from Postgres, one tests fails, but I guess people might well be using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think we should just leave it in unless there is a compelling reason to take it out

@alamb
Copy link
Contributor

alamb commented Aug 1, 2024

I started the tests @samuelcolvin

I haven't had a chance to review this PR yet, but the basic idea of chaning to follow postgres more closely, but giving people the ability to use the old behavior if they want (via a custom Dialect trait), makes sense to me.

Maybe @iffyio @jmhain or @lovasoa have an opinion they would like to share

@coveralls
Copy link

coveralls commented Aug 1, 2024

Pull Request Test Coverage Report for Build 10205949279

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 213 of 302 (70.53%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.2%) to 88.992%

Changes Missing Coverage Covered Lines Changed/Added Lines %
tests/sqlparser_postgres.rs 89 96 92.71%
src/dialect/mod.rs 51 91 56.04%
src/dialect/postgresql.rs 48 90 53.33%
Totals Coverage Status
Change from base Build 10169333092: -0.2%
Covered Lines: 27478
Relevant Lines: 30877

💛 - Coveralls

@samuelcolvin
Copy link
Contributor Author

Currently it's the other way around: default behaviour is unchanged, PostgreSQL dialect us changed to match modern postgres, happy to change.

@alamb
Copy link
Contributor

alamb commented Aug 1, 2024

Currently it's the other way around: default behaviour is unchanged, PostgreSQL dialect us changed to match modern postgres, happy to change.

Yes, sorry I meant that the default behavior of PostgresDialect was changed. I think this PR's structure makes sense -- there is no need to change in my opinion

@samuelcolvin
Copy link
Contributor Author

Hi all, I'd love to get this merged - having better compatibility with postgreSQL is pretty important to use at Pydantic Logfire.

We've been trying this for a few days on Logfire via a branch based of the released version of sqlparser-rs (v0.49.0) and haven't discovered any issues with it.

@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

Thanks for the ping @samuelcolvin -- reviewing now

@alamb alamb changed the title Postgres lexical precedence Support Dialect level precedence, update Postgres Dialect to match Postgres Aug 6, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @samuelcolvin -- I think this looks great to me. I have some ideas on how to improve the documentation, but will make those in a follow on PR

I also updated the PR description to try and highlight the movement of precedence to the Dialect

}

// Define the lexical Precedence of operators.
//
// Uses (APPROXIMATELY) <https://www.postgresql.org/docs/7.0/operators.htm#AEN2026> as a reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine

Comment on lines +31 to +32
// there's no XOR operator in PostgreSQL, but support it here to avoid breaking tests
const XOR_PREC: u8 = 75;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think we should just leave it in unless there is a compelling reason to take it out

@alamb alamb merged commit a5480ae into apache:main Aug 6, 2024
10 checks passed
Copy link
Contributor

@lovasoa lovasoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late review

Comment on lines +4488 to +4498
projection: vec![SelectItem::UnnamedExpr(Expr::BinaryOp {
left: Box::new(Expr::BinaryOp {
left: Box::new(Expr::Identifier(Ident {
value: "foo".to_string(),
quote_style: None,
})),
op: arrow_operator,
right: Box::new(Expr::Value(Value::SingleQuotedString("bar".to_string()))),
}),
op: BinaryOperator::Eq,
right: Box::new(Expr::Value(Value::SingleQuotedString("spam".to_string()))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could avoid the verbosity by using verified_expr instead of verified_stmt here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a great idea -- I made a PR with this change: #1367

Comment on lines +88 to +181
fn get_next_precedence(&self, parser: &Parser) -> Option<Result<u8, ParserError>> {
let token = parser.peek_token();
debug!("get_next_precedence() {:?}", token);

let precedence = match token.token {
Token::Word(w) if w.keyword == Keyword::OR => OR_PREC,
Token::Word(w) if w.keyword == Keyword::XOR => XOR_PREC,
Token::Word(w) if w.keyword == Keyword::AND => AND_PREC,
Token::Word(w) if w.keyword == Keyword::AT => {
match (
parser.peek_nth_token(1).token,
parser.peek_nth_token(2).token,
) {
(Token::Word(w), Token::Word(w2))
if w.keyword == Keyword::TIME && w2.keyword == Keyword::ZONE =>
{
AT_TZ_PREC
}
_ => self.prec_unknown(),
}
}

Token::Word(w) if w.keyword == Keyword::NOT => match parser.peek_nth_token(1).token {
// The precedence of NOT varies depending on keyword that
// follows it. If it is followed by IN, BETWEEN, or LIKE,
// it takes on the precedence of those tokens. Otherwise, it
// is not an infix operator, and therefore has zero
// precedence.
Token::Word(w) if w.keyword == Keyword::IN => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::BETWEEN => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::LIKE => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::ILIKE => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::RLIKE => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::REGEXP => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::SIMILAR => BETWEEN_LIKE_PREC,
_ => self.prec_unknown(),
},
Token::Word(w) if w.keyword == Keyword::IS => IS_PREC,
Token::Word(w) if w.keyword == Keyword::IN => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::BETWEEN => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::LIKE => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::ILIKE => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::RLIKE => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::REGEXP => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::SIMILAR => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::OPERATOR => BETWEEN_LIKE_PREC,
Token::Word(w) if w.keyword == Keyword::DIV => MUL_DIV_MOD_OP_PREC,
Token::Word(w) if w.keyword == Keyword::COLLATE => COLLATE_PREC,
Token::Eq
| Token::Lt
| Token::LtEq
| Token::Neq
| Token::Gt
| Token::GtEq
| Token::DoubleEq
| Token::Tilde
| Token::TildeAsterisk
| Token::ExclamationMarkTilde
| Token::ExclamationMarkTildeAsterisk
| Token::DoubleTilde
| Token::DoubleTildeAsterisk
| Token::ExclamationMarkDoubleTilde
| Token::ExclamationMarkDoubleTildeAsterisk
| Token::Spaceship => EQ_PREC,
Token::Caret => CARET_PREC,
Token::Plus | Token::Minus => PLUS_MINUS_PREC,
Token::Mul | Token::Div | Token::Mod => MUL_DIV_MOD_OP_PREC,
Token::DoubleColon => DOUBLE_COLON_PREC,
Token::LBracket => BRACKET_PREC,
Token::Arrow
| Token::LongArrow
| Token::HashArrow
| Token::HashLongArrow
| Token::AtArrow
| Token::ArrowAt
| Token::HashMinus
| Token::AtQuestion
| Token::AtAt
| Token::Question
| Token::QuestionAnd
| Token::QuestionPipe
| Token::ExclamationMark
| Token::Overlap
| Token::CaretAt
| Token::StringConcat
| Token::Sharp
| Token::ShiftRight
| Token::ShiftLeft
| Token::Pipe
| Token::Ampersand
| Token::CustomBinaryOperator(_) => PG_OTHER_PREC,
_ => self.prec_unknown(),
};
Some(Ok(precedence))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear a little bit for the added maintenance burden here. What do you think @alamb ?

There are still many SQL tokens to add to sqlparser, and there is a chance people will add them only in the generic dialect, leaving postgres silently broken... Isn't there ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think dialect divergence is a possibly and likely made more so by this PR

In fact, in some ways the point of this PR is to start to diverge the dialects with more accurate precedence rules 🤔

The only way I know to guard against this divergence is with test coverage for all new features (which we are pretty good about in general).

Do you have any other thoughts about how to improve the situation @lovasoa ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could change postgres' get_next_precedence to just call the generic dialect's get_next_precedence and "fixing up" the result. This way, we remove the duplication, and when we change the generic dialect, the postgres version reflects the changes automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

fn get_next_precedence(&self, parser: &Parser) -> Option<Result<u8, ParserError>> {
    if some_particular_case { return some_particular_precedence }
    GenericDialect.get_next_precedence(parser)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lovasoa I agree it's unfortunate, there's so much duplication in the the postgres dialect.

However your GenericDialect.get_next_precedence(parser) solution don't help, we can already return None from get_next_precedence to fallback to the default behaviour.


The reason the Postgres dialect is completely reimplementing the logic is that it has to change the precedence values for virtually every token to match Postgres.

We can either:

  • implement ~15 more methods on the trait to get every single precedence value
  • or, add another abstraction on top of Token and replace all the prec_* methods with a single get_prec(&self, precedence_level: PrecedenceLevel)

These would avoid the need to duplicate the logic in Dialect::get_next_precedence_default, but would introduce (albeit less) duplication elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in #1379.

@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

Follow on PR to update comments #1366

@@ -67,6 +85,102 @@ impl Dialect for PostgreSqlDialect {
)
}

fn get_next_precedence(&self, parser: &Parser) -> Option<Result<u8, ParserError>> {
Copy link
Contributor

@lovasoa lovasoa Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @samuelcolvin : wouldn't it be better to have a Result<Option<u8>> instead of an Option<Result<u8>> ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would:

  • differ from the signature of Iterator::next()
  • be a breaking change to the API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think the transpose() method can be used to convert Option<Result<..>> to Result<Option<..>>

https://doc.rust-lang.org/std/option/enum.Option.html#method.transpose

I don't have any particular preference for one over the other

ayman-sigma pushed a commit to sigmacomputing/sqlparser-rs that referenced this pull request Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants