[SPARK-49960][SQL] Custom ExpressionEncoder support and TransformingEncoder fixes #50023

chris-twiner · 2025-02-20T15:39:00Z

What changes were proposed in this pull request?

4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait:

@DeveloperApi
trait AgnosticExpressionPathEncoder[T]
  extends AgnosticEncoder[T] {
  def toCatalyst(input: Expression): Expression
  def fromCatalyst(inputPath: Expression): Expression
}

and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as frameless or sparksql-scalapb).

SPARK-49960 provides the same information.

Additionally this PR provides fixes necessary to use TransformingEncoder as a root encoder with an OptionalEncoder, use as an ArrayType and MapType entry/key.

Why are the changes needed?

Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError.

Was this patch authored or co-authored using generative AI tooling?

No

…r fixes

…r fixes - add deprecation note

chris-twiner · 2025-02-20T15:46:52Z

replaces #48477 with TransformingEncoder fixes.

This allows all of Frameless tests to pass when used either with the backwards compat AgnosticExpressionPathEncoder root and all tests to work with the frameless AgnosticEncoder based encoder derivation branch.

One ExpressionEncoderSuite test "transforming encoders as value class - Frameless value class as parameter use case" does not work when using a TransformingEncoder over the string field. I'll raise another issue should this be a valid use case and an actual bug.

chris-twiner · 2025-02-20T19:25:47Z

@hvanhovell - per our convo

…ansformingEncoder anyway

sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/AgnosticEncoder.scala

hvanhovell · 2025-02-25T16:25:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

@@ -228,7 +236,8 @@ case class ExpressionEncoder[T](
   * returns true if `T` is serialized as struct and is not `Option` type.
   */
  def isSerializedAsStructForTopLevel: Boolean = {
-    isSerializedAsStruct && !classOf[Option[_]].isAssignableFrom(clsTag.runtimeClass)
+    isSerializedAsStruct && !classOf[Option[_]].isAssignableFrom(clsTag.runtimeClass) &&


isSerializedAsStruct && !transformerOfOption(encoder)?

I am wondering if we should make these checks part of the AgnosticEncoder api.

I think it'd make sense, for the path encoder backwards compat logic I can embed / document that in shim. The Builders could embed that. I can take a stab at that post rc2.

hvanhovell · 2025-02-25T16:29:59Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala

@@ -142,6 +142,19 @@ case class OptionNestedGeneric[T](list: Option[T])
 case class MapNestedGenericKey[T](list: Map[T, Int])
 case class MapNestedGenericValue[T](list: Map[Int, T])

+// ADT encoding for TransformingEncoder test
+trait Base {


For this case I really want to add some sort of a UnionEncoder. That either nests the individual implementations, or flattens them (+ a discriminator field).

It would save an intermediary representation, the problem is the scope of the field, locking it down to one field doesn't seem right but if the field could be optionally dropped then it could be a generated Column by the users. I'll take a stab at it post rc2

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SerializerBuildHelper.scala

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

hvanhovell · 2025-02-26T02:25:03Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+    assert(ds.collect().toVector === data.toVector)
+  }
+
+  test("""Encoder derivation with TransformingEncoder of OptionEncoder""".stripMargin) {


Is this illustrating a different issue than above?

not really, it's only testing the recursion works in the ExpressionEncoder() detection. happy to remove

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala

hvanhovell

Looks good overall.

I am fine with merging it as is. Or a I can wait a bit so you can address the comments. I would like to get this in by RC2 (this will be cut end of this week). Please let me know what works for you.

…ullable

…te, the types are proved by round tripping anyway

chris-twiner · 2025-02-26T14:42:45Z

Looks good overall.

I am fine with merging it as is. Or a I can wait a bit so you can address the comments. I would like to get this in by RC2 (this will be cut end of this week). Please let me know what works for you.

fyi as you are coming on line now - I've got every point of the feedback implemented (excluding the two post rc2 notes I mentioned above - nullable top level field and union). Test fails are long running, no issue and scalafmt works locally at least, I'm retrying the fails...

…ses (Spark or frameless)

hvanhovell

LGTM

hvanhovell · 2025-02-28T19:23:41Z

@chris-twiner can you fix the style issue?

…on - fun

chris-twiner · 2025-02-28T20:38:11Z

@chris-twiner can you fix the style issue?

@hvanhovell - yeah done

hvanhovell · 2025-03-01T01:01:52Z

Merging to master/4.0. Thanks!

…ncoder fixes ### What changes were proposed in this pull request? 4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait: ```scala DeveloperApi trait AgnosticExpressionPathEncoder[T] extends AgnosticEncoder[T] { def toCatalyst(input: Expression): Expression def fromCatalyst(inputPath: Expression): Expression } ``` and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as [frameless](https://github.com/typelevel/frameless) or [sparksql-scalapb](https://github.com/scalapb/sparksql-scalapb)). SPARK-49960 provides the same information. Additionally this PR provides fixes necessary to use TransformingEncoder as a root encoder with an OptionalEncoder, use as an ArrayType and MapType entry/key. ### Why are the changes needed? Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError. ### Was this patch authored or co-authored using generative AI tooling? No Closes #50023 from chris-twiner/temp/expressionEncoder_compat_TransformingEncoder_fixes. Authored-by: Chris Twiner <chris.twiner@gmail.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 50a328b) Signed-off-by: Herman van Hovell <herman@databricks.com>

[SPARK-49960] Custom ExpressionEncoder support and TransformingEncode…

db54298

…r fixes

github-actions bot added the SQL label Feb 20, 2025

chris-twiner mentioned this pull request Feb 20, 2025

[SPARK-49960][SQL] Provide extension point for custom AgnosticEncoder serde #48477

Closed

[SPARK-49960] Custom ExpressionEncoder support and TransformingEncode…

1a20169

…r fixes - add deprecation note

[SPARK-49960] Test case is invalid - value classes should not have Tr…

cff4eeb

…ansformingEncoder anyway

hvanhovell reviewed Feb 25, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/AgnosticEncoder.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Feb 25, 2025

View reviewed changes