Make Literals foldable, ensure Parquet predicates pushdown #721

chris-twiner · 2023-06-05T10:08:38Z

The rule actually works at this plan level, so no extension required. At time of writing 3.x should support the getPushDowns approach, 3.5 snaps currently do at least.

I've also added a dependency on naked fs, to allow windows dev without winutils. The StreamingFS class should go away with #5 on naked fs

Closes #343

codecov · 2023-06-05T10:24:49Z

Codecov Report

Merging #721 (411871b) into master (4af0dc6) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #721      +/-   ##
==========================================
+ Coverage   95.52%   95.56%   +0.03%     
==========================================
  Files          67       67              
  Lines        1184     1172      -12     
  Branches       39       41       +2     
==========================================
- Hits         1131     1120      -11     
+ Misses         53       52       -1

Flag	Coverage Δ
2.12.17	`95.56% <100.00%> (+0.03%)`	⬆️
2.13.10	`96.20% <100.00%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...taset/src/main/scala/frameless/functions/Lit.scala	`100.00% <100.00%> (ø)`
...t/src/main/scala/frameless/functions/package.scala	`100.00% <100.00%> (+4.16%)`	⬆️

chris-twiner · 2023-06-05T12:05:46Z

nb the patch check is due to previously untested code being seen as new code because of the "," added :)

pomadchin

Thanks for an awesome PR, I added a couple of comments, it is all tests cases related. 🎉

Usually those optimiation rules are injected via config, I think it also makes sense to add an injector, so users may specify it in the Spark config (i.e. in the DataBricks cluster config) ~

config.set("spark.sql.extensions", classOf[FramelessPushdownOptimizations].getName)

And we need docs!

I can help you with some of those aspects if needed, and I def don't want to step on your toes.

dataset/src/main/scala/frameless/functions/Lit.scala

dataset/src/main/scala/frameless/optimiser/LiteralRule.scala

dataset/src/test/scala/frameless/TypedDatasetSuite.scala

dataset/src/test/scala/frameless/optimiser/LitTests.scala

chris-twiner · 2023-06-05T18:51:11Z

re Databricks cluster config, it requires an uber + probably shaded jar, effectively making frameless part of databricks. I do exactly this in Quality which isn't so straight forward.

Frameless could offer a shaded/uber jar for this purpose but I think for the number of users it's better off being left in their hands and document that they have provided build dependencies in that case. The experimental (for like 4 years now I think) optimiser route is definitely easier to integrate.

If more optimisations get added it could be worthwhile. I did think of one whilst doing tests - struct push downs, you'd have to unpack each struct field for an entire tree to do comparisons, maps wouldn't work either. That's definitely fun code as well.

…rsions

…and experimental rules

chris-twiner · 2023-06-06T10:32:34Z

So whilst most types will work with the experimental approach, structs don't, and I assume others will also not, the experimental rules don't occur early enough - a painful lesson learnt on Quality. (added verification tests for this but the reason is the Literal swap happens after the cast from X1[X4.. to struct would occur and the cast simplification rules aren't yet hit).

So docs wise would you prefer a ymmv caveat on the use of experimental and more time on the extension as the preferred route? Similarly are you ok with linking to the Quality docs on how to register on Databricks or would you prefer a c+p?

pomadchin · 2023-06-06T12:38:29Z

So docs wise would you prefer a ymmv caveat on the use of experimental and more time on the extension as the preferred route? Similarly are you ok with linking to the Quality docs on how to register on Databricks or would you prefer a c+p?

I think to document our optimization rules, and usage examples -- via config & injector, and by manually appending rules into the context.

dataset/src/test/scala/frameless/optimiser/LitTests.scala

dataset/src/main/scala/frameless/optimiser/Extension.scala

pomadchin · 2023-06-06T13:18:14Z

So whilst most types will work with the experimental approach, structs don't, and I assume others will also not, the experimental rules don't occur early enough - a painful lesson learnt on Quality. (added verification tests for this but the reason is the Literal swap happens after the cast from X1[X4.. to struct would occur and the cast simplification rules aren't yet hit).

I think it should work, smth else is off.

If you replace X1[X4 with smth like:

case class Inner(a: Int, b: Int, c: Int, d: Int)
case class Test(a: Inner)

The output will be:

== Parsed Logical Plan ==
Filter (a#9.a > 0)
+- Relation [a#9] parquet

== Analyzed Logical Plan ==
a: struct<a:int,b:int,c:int,d:int>
Filter (a#9.a > 0)
+- Relation [a#9] parquet

== Optimized Logical Plan ==
Filter (isnotnull(a#9.a) AND (a#9.a > 0))
+- Relation [a#9] parquet

== Physical Plan ==
*(1) Filter (isnotnull(a#9.a) AND (a#9.a > 0))
+- *(1) ColumnarToRow
   +- FileScan parquet [a#9] Batched: true, DataFilters: [isnotnull(a#9.a), (a#9.a > 0)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/.../subversions/git/github/.../framele..., PartitionFilters: [], PushedFilters: [IsNotNull(a.a), GreaterThan(a.a,0)], ReadSchema: struct<a:struct<a:int,b:int,c:int,d:int>>

chris-twiner · 2023-06-06T13:26:24Z

That's not the same test though, that's a single nested field being compared.

pomadchin · 2023-06-06T13:28:14Z

@chris-twiner oh, yea that's the difference! I don't believe pushdown works for non-primitives.

pomadchin · 2023-06-06T13:31:58Z

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L227-L235

…is injected

chris-twiner · 2023-06-06T13:37:47Z

perhaps not generally but they are possible to push down: the test

The source.EqualTo is there when using the extension - not with experimental.

pomadchin · 2023-06-06T13:45:07Z

@chris-twiner yes, EqualTo pushdown is supported, but not for all its arguments types in case of a Parquet format.

So if there is some other format that supports non primitives pushdown it can be pushed. In the optimize plan step optimizer applies custom rules, and if those expanded / rewritten predicates are supported for the push down -- they will be pushed.

TBH (that's about naming), I think those two features (manual rules appending and injection via injector) are considered to be experimental since Spark 2.x (don't know a specific version) :D But I could be wrong^

chris-twiner · 2023-06-07T16:11:11Z

it's because the values aren't foldable:

override val foldable: Boolean = true//catalystExpr.foldable

In 3.3 InvokeLike introduced:

override def foldable: Boolean =
    children.forall(_.foldable) && deterministic && trustedSerializable(dataType)

so it's missing from 3.2 . That stops timestamp and instant.

if (isnull(X4(1,2,3,4))) null else named_struct(a, X4(1,2,3,4).a, b, X4(1,2,3,4).b, c, X4(1,2,3,4).c, d, X4(1,2,3,4).d)

isnull is foldable, null is foldable but named_struct is not because of the invokes.

So the rule itself isn't needed, it's equivalent to foldable = true

pomadchin · 2023-06-07T16:14:07Z

@chris-twiner yea, let's make it just true in this case 👍 good to know. Lets add a link to your comments into the code comments so we don't forget the decision made!

chris-twiner · 2023-06-07T16:48:16Z

@chris-twiner yea, let's make it just true in this case 👍 good to know. Lets add a link to your comments into the code comments so we don't forget the decision made!

done, I guess github has had enough of this for a bit though, the CI isn't running.

dataset/src/main/scala/frameless/functions/Lit.scala

pomadchin

🔥

pomadchin

🔥

(upd: idk why there are two reviews)

chris-twiner · 2023-06-08T10:47:32Z

Per the text in SPARK-40380 this is probably not correct as it'd affect sparksql-scalapb a user of frameless.

Rather than break out another source compatibility layer I'll look at a compile time elide or similar and sub in a different solution for 3.2.

Question is should the solution just be accept failure on 3.2 (and document predicate pushdown won't work on 3.2) or actually provide a backport of 40380?

pomadchin · 2023-06-08T14:02:13Z

@chris-twiner could you tell more, how that issue is related to this PR, and to sparksql-scalapb?

chris-twiner · 2023-06-08T14:23:49Z

@chris-twiner could you tell more, how that issue is related to this PR, and to sparksql-scalapb?

sorry, was trying to create an example. With foldable always true Invokes could create unserializable expressions (that are ObjectType results). These would be fine when not folded, but folding will cause the query to fail as the expression in Literal would be non-serializable.

3.3.1 and above "fixes" that by this code in InvokeLike:

override def foldable: Boolean =
    children.forall(_.foldable) && deterministic && trustedSerializable(dataType)
// Returns true if we can trust all values of the given DataType can be serialized.
  private def trustedSerializable(dt: DataType): Boolean = {
    // Right now we conservatively block all ObjectType (Java objects) regardless of
    // serializability, because the type-level info with java.io.Serializable and
    // java.io.Externalizable marker interfaces are not strong guarantees.
    // This restriction can be relaxed in the future to expose more optimizations.
    !dt.existsRecursively(_.isInstanceOf[ObjectType])
  }

so by not calling catalystExpr.foldable we risk introducing an unserializable ObjectType and stopping the query from running at all.

The example code in that jira will "work" in 3.2.4 but fail with a serialization error in more recent Spark versions. It's that exact failure that is stopped in folding by the above code snippet.

pomadchin · 2023-06-08T14:47:48Z

@chris-twiner we could backport this function into the FramelessInternals package:

  // https://github.com/typelevel/frameless/pull/721
  // TODO: remove with the Spark 3.2 support drop
  def isFoldableExpressionCompat(expr: Expression): Boolean = {
    // Returns true if we can trust all values of the given DataType can be serialized.
    def trustedSerializable(dt: DataType): Boolean = {
      // Right now we conservatively block all ObjectType (Java objects) regardless of
      // serializability, because the type-level info with java.io.Serializable and
      // java.io.Externalizable marker interfaces are not strong guarantees.
      // This restriction can be relaxed in the future to expose more optimizations.
      !dt.existsRecursively(_.isInstanceOf[ObjectType])
    }

    expr.children.forall(_.foldable) && expr.deterministic && trustedSerializable(expr.dataType)
  }

and then on the user side it is smth like

// in Lit.scala
// https://github.com/typelevel/frameless/pull/721
// TODO: replace with catalystExpr.foldable with the Spark 3.2 drop
override val foldable: Boolean = FramelessInternals.isFoldableExpressionCompat(catalystExpr)

chris-twiner · 2023-06-08T14:56:54Z

That will fail on 3.2 though. It would need to be a bottom up check with that logic called for InvokeLike.

pomadchin · 2023-06-08T14:58:18Z

Oh you're right.

pomadchin · 2023-06-08T14:59:39Z

Let's still add this backport for the 3.2/3.3+ compat (we need that foldable check), but we'll leave a note about this issue, imo that's the Spark problem.

…oper foldable test

chris-twiner · 2023-06-08T15:25:32Z

Let's still add this backport for the 3.2/3.3+ compat (we need that foldable check), but we'll leave a note about this issue, imo that's the Spark problem.

agreed, I've updated the comment on that and disabled the tests for 3.2.

dataset/src/test/scala/frameless/sql/rules/SQLRulesSuite.scala

…06 and SPARK-40380

dataset/src/main/spark-32/frameless/FoldableImpl.scala

pomadchin

LGTM! Left a couple of suggestions around the dir names!

build.sbt

pomadchin

🔥 Quite a PR

chris-twiner · 2023-06-09T08:52:05Z

🔥 Quite a PR

indeed, but a fun journey

chris-twiner added 2 commits June 5, 2023 12:01

typelevel#343 - unpack to Literals

3bbdb9c

typelevel#343 - unpack to Literals - more test

3df02ec

chris-twiner mentioned this pull request Jun 5, 2023

Parquet predicate pushdown doesn't seem to be working with timestamps #343

Closed

typelevel#343 - unpack to Literals - comment

c8ecea8

pomadchin self-requested a review June 5, 2023 14:19

pomadchin reviewed Jun 5, 2023

View reviewed changes

pomadchin added the feature label Jun 5, 2023

typelevel#343 - per review - docs missing

b7c3132

chris-twiner added 2 commits June 5, 2023 22:03

typelevel#343 - per review - docs missing - fix reflection for all ve…

81d9315

…rsions

typelevel#343 - add struct test showing difference between extension …

a3567c2

…and experimental rules

typelevel#343 - toString test to stop the patch complaint

bee3cd0

pomadchin self-requested a review June 6, 2023 12:36

pomadchin reviewed Jun 6, 2023

View reviewed changes

dataset/src/test/scala/frameless/optimiser/LitTests.scala Outdated Show resolved Hide resolved

dataset/src/test/scala/frameless/optimiser/LitTests.scala Outdated Show resolved Hide resolved

dataset/src/main/scala/frameless/optimiser/Extension.scala Outdated Show resolved Hide resolved

pomadchin self-requested a review June 6, 2023 12:40

typelevel#343 - sample docs

bba92cb

typelevel#343 - package rename and adding logging that the extension …

28bde88

…is injected

chris-twiner added 3 commits June 6, 2023 15:46

typelevel#343 - doc fixes

f4e99b5

typelevel#343 - doc fixes

0cbe684

typelevel#343 - can't run that code

c308241

pomadchin changed the title ~~#343 - Make Literals foldable, ensure Parquet literals pushdown~~ #343 - Make Literals foldable, ensure Parquet predicates pushdown Jun 7, 2023

chris-twiner added 2 commits June 7, 2023 18:23

typelevel#343 - true with link for 3.2 support

82bf013

typelevel#343 - bring back code gen with lazy to stop recompiles

c6bbe2c

pomadchin added enhancement and removed feature labels Jun 7, 2023

pomadchin reviewed Jun 7, 2023

View reviewed changes

dataset/src/main/scala/frameless/functions/Lit.scala Show resolved Hide resolved

pomadchin approved these changes Jun 7, 2023

View reviewed changes

pomadchin changed the title ~~#343 - Make Literals foldable, ensure Parquet predicates pushdown~~ Make Literals foldable, ensure Parquet predicates pushdown Jun 8, 2023

typelevel#343 - disable tests on 3.2, document why and renable the pr…

31a023f

…oper foldable test

pomadchin requested changes Jun 8, 2023

View reviewed changes

dataset/src/test/scala/frameless/sql/rules/SQLRulesSuite.scala Outdated Show resolved Hide resolved

typelevel#343 - more compat and a foldable only backport of SPARK-391…

d7db649

…06 and SPARK-40380

pomadchin reviewed Jun 8, 2023

View reviewed changes

dataset/src/main/spark-32/frameless/FoldableImpl.scala Outdated Show resolved Hide resolved

typelevel#343 - option 3 - let 3.2 fail as per oss impl, seperate tests

0e6c561

pomadchin reviewed Jun 8, 2023

View reviewed changes

build.sbt Outdated Show resolved Hide resolved

build.sbt Outdated Show resolved Hide resolved

typelevel#343 - option 3 - better dir names

411871b

pomadchin approved these changes Jun 8, 2023

View reviewed changes

pomadchin merged commit dec676b into typelevel:master Jun 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Literals foldable, ensure Parquet predicates pushdown #721

Make Literals foldable, ensure Parquet predicates pushdown #721

chris-twiner commented Jun 5, 2023 •

edited by pomadchin

Loading

codecov bot commented Jun 5, 2023 •

edited

Loading

chris-twiner commented Jun 5, 2023

pomadchin left a comment

chris-twiner commented Jun 5, 2023

chris-twiner commented Jun 6, 2023 •

edited

Loading

pomadchin commented Jun 6, 2023

pomadchin commented Jun 6, 2023 •

edited

Loading

chris-twiner commented Jun 6, 2023

pomadchin commented Jun 6, 2023 •

edited

Loading

pomadchin commented Jun 6, 2023

chris-twiner commented Jun 6, 2023

pomadchin commented Jun 6, 2023 •

edited

Loading

chris-twiner commented Jun 7, 2023

pomadchin commented Jun 7, 2023

chris-twiner commented Jun 7, 2023

pomadchin left a comment

pomadchin left a comment •

edited

Loading

chris-twiner commented Jun 8, 2023

pomadchin commented Jun 8, 2023

chris-twiner commented Jun 8, 2023 •

edited

Loading

pomadchin commented Jun 8, 2023 •

edited

Loading

chris-twiner commented Jun 8, 2023

pomadchin commented Jun 8, 2023

pomadchin commented Jun 8, 2023 •

edited

Loading

chris-twiner commented Jun 8, 2023

pomadchin left a comment •

edited

Loading

pomadchin left a comment

chris-twiner commented Jun 9, 2023

Make Literals foldable, ensure Parquet predicates pushdown #721

Make Literals foldable, ensure Parquet predicates pushdown #721

Conversation

chris-twiner commented Jun 5, 2023 • edited by pomadchin Loading

codecov bot commented Jun 5, 2023 • edited Loading

Codecov Report

chris-twiner commented Jun 5, 2023

pomadchin left a comment

Choose a reason for hiding this comment

chris-twiner commented Jun 5, 2023

chris-twiner commented Jun 6, 2023 • edited Loading

pomadchin commented Jun 6, 2023

pomadchin commented Jun 6, 2023 • edited Loading

chris-twiner commented Jun 6, 2023

pomadchin commented Jun 6, 2023 • edited Loading

pomadchin commented Jun 6, 2023

chris-twiner commented Jun 6, 2023

pomadchin commented Jun 6, 2023 • edited Loading

chris-twiner commented Jun 7, 2023

pomadchin commented Jun 7, 2023

chris-twiner commented Jun 7, 2023

pomadchin left a comment

Choose a reason for hiding this comment

pomadchin left a comment • edited Loading

Choose a reason for hiding this comment

chris-twiner commented Jun 8, 2023

pomadchin commented Jun 8, 2023

chris-twiner commented Jun 8, 2023 • edited Loading

pomadchin commented Jun 8, 2023 • edited Loading

chris-twiner commented Jun 8, 2023

pomadchin commented Jun 8, 2023

pomadchin commented Jun 8, 2023 • edited Loading

chris-twiner commented Jun 8, 2023

pomadchin left a comment • edited Loading

Choose a reason for hiding this comment

pomadchin left a comment

Choose a reason for hiding this comment

chris-twiner commented Jun 9, 2023

chris-twiner commented Jun 5, 2023 •

edited by pomadchin

Loading

codecov bot commented Jun 5, 2023 •

edited

Loading

chris-twiner commented Jun 6, 2023 •

edited

Loading

pomadchin commented Jun 6, 2023 •

edited

Loading

pomadchin commented Jun 6, 2023 •

edited

Loading

pomadchin commented Jun 6, 2023 •

edited

Loading

pomadchin left a comment •

edited

Loading

chris-twiner commented Jun 8, 2023 •

edited

Loading

pomadchin commented Jun 8, 2023 •

edited

Loading

pomadchin commented Jun 8, 2023 •

edited

Loading

pomadchin left a comment •

edited

Loading