Support `Dialect` level precedence, update Postgres `Dialect` to match Postgres #1360

samuelcolvin · 2024-08-01T00:20:00Z

~~This is ugly but working, I wanted to get feedback before proceeding.~~ I think this is now in a good state, definitely ready to review.

This solves #814 (comment), e.g. it gives the right AST for SELECT foo->'bar'='spam' etc.

But before I go further, I want to confirm a few things?

Can we update precedence globally?

The claim at

https://github.com/sqlparser-rs/sqlparser-rs/blob/cc13841a370190df00cfc623593453054ac187e9/src/parser/mod.rs#L2972

Isn't really true at all, sqlparser-rs doesn't use the lexical precedence from PostgreSQL 7, even if it did, that's extremely old.

Could we just update those values to reference a more recent version?

I assume not since that order is presumably a not-too-bad approximation for lots of databases, and changing it would be a big breaking change

Can we even update `PostgresSQL` dialect

Presumably people are relying on the current logic, will you even accept a PR to update PostgreSqlDialect even if it's wrong now?

Or should a create a RecentPostgreSqlDialect as a new dialect?

What about operators that don't work in some dialects?

Postgres doesn't actually support XOR, but there are tests asserting that it works, do you want to keep supporting it with an invented precedence, or cause an error?

Can I move precedence numbers to the `Dialect` trait?

The Precedence enum shown here is pretty ugly, would you mind me moving the precedence stuff, including the default implementation of get_next_precedence to the Dialect trait?

samuelcolvin · 2024-08-01T15:51:31Z

The above questions stand, but I think this is in a state where it could at least be reviewed. @alamb could you kick off tests.

I'm going to use it in Logfire via a git ref now.

samuelcolvin · 2024-08-01T15:53:53Z

src/dialect/mod.rs

@@ -300,13 +304,172 @@ pub trait Dialect: Debug + Any {
        // return None to fall back to the default behavior
        None
    }
+
+    /// Get the precedence of the next token. This "full" method means all precedence logic and remain


This is moved to Dialect so dialects can change the precedence without all the numeric constants being shared between modules.

it's implemented as a separate function from get_next_precedence:

To avoid breaking the API

To allow dialects to customise precedence but optionally fall back to the default behaviour, as demonstrated Snowflake.

samuelcolvin · 2024-08-01T15:54:58Z

src/dialect/mod.rs

+
+    /// The following precedence values are used directly by `Parse` or in dialects,
+    /// so have to be made public by the dialect.
+    fn prec_double_colon(&self) -> u8 {


these are the precedence values that either Parser needs to know about, or are used by implementations of the trait, e.g. prec_double_colon is used by Snowflake.

samuelcolvin · 2024-08-01T15:55:27Z

src/dialect/mod.rs

 }

+// Define the lexical Precedence of operators.
+//
+// Uses (APPROXIMATELY) <https://www.postgresql.org/docs/7.0/operators.htm#AEN2026> as a reference


This statement really isn't true, hence I added APPROXIMATELY.

We could rewrite to "was originally inspired by" or something?

I think this is fine

samuelcolvin · 2024-08-01T15:56:32Z

src/dialect/postgresql.rs

+// there's no XOR operator in PostgreSQL, but support it here to avoid breaking tests
+const XOR_PREC: u8 = 75;


I'm not sure what to do about this, if we remove XOR logic from Postgres, one tests fails, but I guess people might well be using it.

yeah I think we should just leave it in unless there is a compelling reason to take it out

alamb · 2024-08-01T20:11:15Z

I started the tests @samuelcolvin

I haven't had a chance to review this PR yet, but the basic idea of chaning to follow postgres more closely, but giving people the ability to use the old behavior if they want (via a custom Dialect trait), makes sense to me.

Maybe @iffyio @jmhain or @lovasoa have an opinion they would like to share

coveralls · 2024-08-01T20:14:27Z

Pull Request Test Coverage Report for Build 10205949279

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

213 of 302 (70.53%) changed or added relevant lines in 5 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.2%) to 88.992%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
tests/sqlparser_postgres.rs	89	96	92.71%
src/dialect/mod.rs	51	91	56.04%
src/dialect/postgresql.rs	48	90	53.33%

Totals
Change from base Build 10169333092:	-0.2%
Covered Lines:	27478
Relevant Lines:	30877

💛 - Coveralls

samuelcolvin · 2024-08-01T20:15:55Z

Currently it's the other way around: default behaviour is unchanged, PostgreSQL dialect us changed to match modern postgres, happy to change.

alamb · 2024-08-01T21:19:18Z

Currently it's the other way around: default behaviour is unchanged, PostgreSQL dialect us changed to match modern postgres, happy to change.

Yes, sorry I meant that the default behavior of PostgresDialect was changed. I think this PR's structure makes sense -- there is no need to change in my opinion

samuelcolvin · 2024-08-06T11:37:43Z

Hi all, I'd love to get this merged - having better compatibility with postgreSQL is pretty important to use at Pydantic Logfire.

We've been trying this for a few days on Logfire via a branch based of the released version of sqlparser-rs (v0.49.0) and haven't discovered any issues with it.

alamb · 2024-08-06T11:43:22Z

Thanks for the ping @samuelcolvin -- reviewing now

alamb

Thank you @samuelcolvin -- I think this looks great to me. I have some ideas on how to improve the documentation, but will make those in a follow on PR

I also updated the PR description to try and highlight the movement of precedence to the Dialect

alamb · 2024-08-06T11:45:24Z

src/dialect/mod.rs

 }

+// Define the lexical Precedence of operators.
+//
+// Uses (APPROXIMATELY) <https://www.postgresql.org/docs/7.0/operators.htm#AEN2026> as a reference


I think this is fine

alamb · 2024-08-06T11:46:04Z

src/dialect/postgresql.rs

+// there's no XOR operator in PostgreSQL, but support it here to avoid breaking tests
+const XOR_PREC: u8 = 75;


yeah I think we should just leave it in unless there is a compelling reason to take it out

lovasoa

sorry for the late review

lovasoa · 2024-08-06T11:50:14Z

tests/sqlparser_postgres.rs

+                projection: vec![SelectItem::UnnamedExpr(Expr::BinaryOp {
+                    left: Box::new(Expr::BinaryOp {
+                        left: Box::new(Expr::Identifier(Ident {
+                            value: "foo".to_string(),
+                            quote_style: None,
+                        })),
+                        op: arrow_operator,
+                        right: Box::new(Expr::Value(Value::SingleQuotedString("bar".to_string()))),
+                    }),
+                    op: BinaryOperator::Eq,
+                    right: Box::new(Expr::Value(Value::SingleQuotedString("spam".to_string()))),


You could avoid the verbosity by using verified_expr instead of verified_stmt here

It is a great idea -- I made a PR with this change: #1367

lovasoa · 2024-08-06T11:53:41Z

src/dialect/postgresql.rs

+    fn get_next_precedence(&self, parser: &Parser) -> Option<Result<u8, ParserError>> {
+        let token = parser.peek_token();
+        debug!("get_next_precedence() {:?}", token);
+
+        let precedence = match token.token {
+            Token::Word(w) if w.keyword == Keyword::OR => OR_PREC,
+            Token::Word(w) if w.keyword == Keyword::XOR => XOR_PREC,
+            Token::Word(w) if w.keyword == Keyword::AND => AND_PREC,
+            Token::Word(w) if w.keyword == Keyword::AT => {
+                match (
+                    parser.peek_nth_token(1).token,
+                    parser.peek_nth_token(2).token,
+                ) {
+                    (Token::Word(w), Token::Word(w2))
+                        if w.keyword == Keyword::TIME && w2.keyword == Keyword::ZONE =>
+                    {
+                        AT_TZ_PREC
+                    }
+                    _ => self.prec_unknown(),
+                }
+            }
+
+            Token::Word(w) if w.keyword == Keyword::NOT => match parser.peek_nth_token(1).token {
+                // The precedence of NOT varies depending on keyword that
+                // follows it. If it is followed by IN, BETWEEN, or LIKE,
+                // it takes on the precedence of those tokens. Otherwise, it
+                // is not an infix operator, and therefore has zero
+                // precedence.
+                Token::Word(w) if w.keyword == Keyword::IN => BETWEEN_LIKE_PREC,
+                Token::Word(w) if w.keyword == Keyword::BETWEEN => BETWEEN_LIKE_PREC,
+                Token::Word(w) if w.keyword == Keyword::LIKE => BETWEEN_LIKE_PREC,
+                Token::Word(w) if w.keyword == Keyword::ILIKE => BETWEEN_LIKE_PREC,
+                Token::Word(w) if w.keyword == Keyword::RLIKE => BETWEEN_LIKE_PREC,
+                Token::Word(w) if w.keyword == Keyword::REGEXP => BETWEEN_LIKE_PREC,
+                Token::Word(w) if w.keyword == Keyword::SIMILAR => BETWEEN_LIKE_PREC,
+                _ => self.prec_unknown(),
+            },
+            Token::Word(w) if w.keyword == Keyword::IS => IS_PREC,
+            Token::Word(w) if w.keyword == Keyword::IN => BETWEEN_LIKE_PREC,
+            Token::Word(w) if w.keyword == Keyword::BETWEEN => BETWEEN_LIKE_PREC,
+            Token::Word(w) if w.keyword == Keyword::LIKE => BETWEEN_LIKE_PREC,
+            Token::Word(w) if w.keyword == Keyword::ILIKE => BETWEEN_LIKE_PREC,
+            Token::Word(w) if w.keyword == Keyword::RLIKE => BETWEEN_LIKE_PREC,
+            Token::Word(w) if w.keyword == Keyword::REGEXP => BETWEEN_LIKE_PREC,
+            Token::Word(w) if w.keyword == Keyword::SIMILAR => BETWEEN_LIKE_PREC,
+            Token::Word(w) if w.keyword == Keyword::OPERATOR => BETWEEN_LIKE_PREC,
+            Token::Word(w) if w.keyword == Keyword::DIV => MUL_DIV_MOD_OP_PREC,
+            Token::Word(w) if w.keyword == Keyword::COLLATE => COLLATE_PREC,
+            Token::Eq
+            | Token::Lt
+            | Token::LtEq
+            | Token::Neq
+            | Token::Gt
+            | Token::GtEq
+            | Token::DoubleEq
+            | Token::Tilde
+            | Token::TildeAsterisk
+            | Token::ExclamationMarkTilde
+            | Token::ExclamationMarkTildeAsterisk
+            | Token::DoubleTilde
+            | Token::DoubleTildeAsterisk
+            | Token::ExclamationMarkDoubleTilde
+            | Token::ExclamationMarkDoubleTildeAsterisk
+            | Token::Spaceship => EQ_PREC,
+            Token::Caret => CARET_PREC,
+            Token::Plus | Token::Minus => PLUS_MINUS_PREC,
+            Token::Mul | Token::Div | Token::Mod => MUL_DIV_MOD_OP_PREC,
+            Token::DoubleColon => DOUBLE_COLON_PREC,
+            Token::LBracket => BRACKET_PREC,
+            Token::Arrow
+            | Token::LongArrow
+            | Token::HashArrow
+            | Token::HashLongArrow
+            | Token::AtArrow
+            | Token::ArrowAt
+            | Token::HashMinus
+            | Token::AtQuestion
+            | Token::AtAt
+            | Token::Question
+            | Token::QuestionAnd
+            | Token::QuestionPipe
+            | Token::ExclamationMark
+            | Token::Overlap
+            | Token::CaretAt
+            | Token::StringConcat
+            | Token::Sharp
+            | Token::ShiftRight
+            | Token::ShiftLeft
+            | Token::Pipe
+            | Token::Ampersand
+            | Token::CustomBinaryOperator(_) => PG_OTHER_PREC,
+            _ => self.prec_unknown(),
+        };
+        Some(Ok(precedence))


I fear a little bit for the added maintenance burden here. What do you think @alamb ?

There are still many SQL tokens to add to sqlparser, and there is a chance people will add them only in the generic dialect, leaving postgres silently broken... Isn't there ?

I do think dialect divergence is a possibly and likely made more so by this PR

In fact, in some ways the point of this PR is to start to diverge the dialects with more accurate precedence rules 🤔

The only way I know to guard against this divergence is with test coverage for all new features (which we are pretty good about in general).

Do you have any other thoughts about how to improve the situation @lovasoa ?

I think we could change postgres' get_next_precedence to just call the generic dialect's get_next_precedence and "fixing up" the result. This way, we remove the duplication, and when we change the generic dialect, the postgres version reflects the changes automatically.

Something like

fn get_next_precedence(&self, parser: &Parser) -> Option<Result<u8, ParserError>> { if some_particular_case { return some_particular_precedence } GenericDialect.get_next_precedence(parser) }

@lovasoa I agree it's unfortunate, there's so much duplication in the the postgres dialect.

However your GenericDialect.get_next_precedence(parser) solution don't help, we can already return None from get_next_precedence to fallback to the default behaviour.

The reason the Postgres dialect is completely reimplementing the logic is that it has to change the precedence values for virtually every token to match Postgres.

We can either:

implement ~15 more methods on the trait to get every single precedence value

or, add another abstraction on top of Token and replace all the prec_* methods with a single get_prec(&self, precedence_level: PrecedenceLevel)

These would avoid the need to duplicate the logic in Dialect::get_next_precedence_default, but would introduce (albeit less) duplication elsewhere.

Implemented in #1379.

alamb · 2024-08-06T12:04:32Z

Follow on PR to update comments #1366

lovasoa · 2024-08-14T12:54:28Z

src/dialect/postgresql.rs

@@ -67,6 +85,102 @@ impl Dialect for PostgreSqlDialect {
        )
    }

+    fn get_next_precedence(&self, parser: &Parser) -> Option<Result<u8, ParserError>> {


@alamb @samuelcolvin : wouldn't it be better to have a Result<Option<u8>> instead of an Option<Result<u8>> ?

that would:

differ from the signature of Iterator::next()

be a breaking change to the API

I also think the transpose() method can be used to convert Option<Result<..>> to Result<Option<..>>

https://doc.rust-lang.org/std/option/enum.Option.html#method.transpose

I don't have any particular preference for one over the other

…h Postgres (apache#1360)

samuelcolvin added 4 commits July 31, 2024 20:27

custom postgreSQL precedence

d904a1b

correct postgres lexical precedence

278571b

simplify precedence logic, add tests

ac53f97

cleanup

f485bca

samuelcolvin marked this pull request as ready for review August 1, 2024 15:50

samuelcolvin commented Aug 1, 2024

View reviewed changes

alamb mentioned this pull request Aug 1, 2024

DataFusion weekly project plan (Andrew Lamb) - July 29, 2024 apache/datafusion#11710

Closed

8 tasks

fmt

96f1245

alamb mentioned this pull request Aug 5, 2024

DataFusion weekly project plan (Andrew Lamb) - Aug 5, 2024 apache/datafusion#11826

Closed

6 tasks

alamb changed the title ~~Postgres lexical precedence~~ Support Dialect level precedence, update Postgres Dialect to match Postgres Aug 6, 2024

alamb approved these changes Aug 6, 2024

View reviewed changes

alamb merged commit a5480ae into apache:main Aug 6, 2024
10 checks passed

lovasoa suggested changes Aug 6, 2024

View reviewed changes

alamb mentioned this pull request Aug 6, 2024

Improve comments on Dialect #1366

Merged

This was referenced Aug 14, 2024

rename get_next_precedence_full to get_next_precedence_default #1378

Merged

Simple custom lexical precedence in PostgreSQL dialect #1379

Merged

lovasoa reviewed Aug 14, 2024

View reviewed changes

samuelcolvin mentioned this pull request Aug 29, 2024

We're changing database pydantic/logfire#408

Closed

ayman-sigma pushed a commit to sigmacomputing/sqlparser-rs that referenced this pull request Nov 19, 2024

Support Dialect level precedence, update Postgres Dialect to matc…

46bc5ea

…h Postgres (apache#1360)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `Dialect` level precedence, update Postgres `Dialect` to match Postgres #1360

Support `Dialect` level precedence, update Postgres `Dialect` to match Postgres #1360

samuelcolvin commented Aug 1, 2024 •

edited

Loading

samuelcolvin commented Aug 1, 2024

samuelcolvin Aug 1, 2024

samuelcolvin Aug 1, 2024

samuelcolvin Aug 1, 2024

alamb Aug 6, 2024

samuelcolvin Aug 1, 2024

alamb Aug 6, 2024

alamb commented Aug 1, 2024

coveralls commented Aug 1, 2024 •

edited

Loading

samuelcolvin commented Aug 1, 2024

alamb commented Aug 1, 2024

samuelcolvin commented Aug 6, 2024

alamb commented Aug 6, 2024

alamb left a comment

alamb Aug 6, 2024

alamb Aug 6, 2024

lovasoa left a comment

lovasoa Aug 6, 2024

alamb Aug 6, 2024

lovasoa Aug 6, 2024

alamb Aug 6, 2024

lovasoa Aug 6, 2024

lovasoa Aug 6, 2024

samuelcolvin Aug 14, 2024

samuelcolvin Aug 14, 2024

alamb commented Aug 6, 2024 •

edited

Loading

lovasoa Aug 14, 2024 •

edited

Loading

samuelcolvin Aug 14, 2024

alamb Aug 14, 2024

		// there's no XOR operator in PostgreSQL, but support it here to avoid breaking tests
		const XOR_PREC: u8 = 75;

Support Dialect level precedence, update Postgres Dialect to match Postgres #1360

Support Dialect level precedence, update Postgres Dialect to match Postgres #1360

Conversation

samuelcolvin commented Aug 1, 2024 • edited Loading

Can we update precedence globally?

Can we even update PostgresSQL dialect

What about operators that don't work in some dialects?

Can I move precedence numbers to the Dialect trait?

samuelcolvin commented Aug 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Aug 1, 2024

coveralls commented Aug 1, 2024 • edited Loading

Pull Request Test Coverage Report for Build 10205949279

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

samuelcolvin commented Aug 1, 2024

alamb commented Aug 1, 2024

samuelcolvin commented Aug 6, 2024

alamb commented Aug 6, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lovasoa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Aug 6, 2024 • edited Loading

lovasoa Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Support `Dialect` level precedence, update Postgres `Dialect` to match Postgres #1360

Support `Dialect` level precedence, update Postgres `Dialect` to match Postgres #1360

samuelcolvin commented Aug 1, 2024 •

edited

Loading

Can we even update `PostgresSQL` dialect

Can I move precedence numbers to the `Dialect` trait?

coveralls commented Aug 1, 2024 •

edited

Loading

alamb commented Aug 6, 2024 •

edited

Loading

lovasoa Aug 14, 2024 •

edited

Loading