[SPARK-51290][SQL] Enable filling default values in DSv2 writes #50044

aokolnychyi · 2025-02-21T18:53:49Z

What changes were proposed in this pull request?

This PR enables filling default values in DSv2 writes.

Why are the changes needed?

These changes are needed for proper support of default values for DSv2 connectors.

Does this PR introduce any user-facing change?

Users will be able to omit columns with default values. There is no impact to existing jobs.

How was this patch tested?

This patch comes with tests.

Was this patch authored or co-authored using generative AI tooling?

No.

aokolnychyi · 2025-02-21T19:01:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -3534,7 +3534,8 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
        TableOutputResolver.suitableForByNameCheck(v2Write.isByName,
          expected = v2Write.table.output, queryOutput = v2Write.query.output)
        val projection = TableOutputResolver.resolveOutputColumns(
-          v2Write.table.name, v2Write.table.output, v2Write.query, v2Write.isByName, conf)
+          v2Write.table.name, v2Write.table.output, v2Write.query, v2Write.isByName, conf,
+          supportColDefaultValue = true)


I don't think there is value in validating if the catalog defines SUPPORT_COLUMN_DEFAULT_VALUE in capabilities during writes. If a connector includes default value metadata in its schema, it should be enough to fill default values. The flag exists for ALTER and CREATE/REPLACE statements.

Yea true, Spark fills the default values during table writing and it works for all catalogs.

You mean supportColDefaultValue is true or false doesn't matter for v2 here?

Oh, I see. You mean to check for the flag SUPPORT_COLUMN_DEFAULT_VALUE here for the catalog.

Correct, I don't see value in checking SUPPORT_COLUMN_DEFAULT_VALUE here.

aokolnychyi · 2025-02-21T19:05:37Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala

@@ -718,6 +724,11 @@ private class BufferedRowsReader(
      schema: StructType,
      row: InternalRow): Any = {
    val index = schema.fieldIndex(field.name)
+
+    if (index >= row.numFields) {


This is needed for support for adding columns with default values to the end.

This is method extractFieldValue. Looks like it is only used by get. Why this is for adding columns?

This is needed to read data inserted prior to adding columns to the schema. If that happens, there would be extra columns in the schema and we have to default new columns using the existence default value.

aokolnychyi · 2025-02-21T23:16:11Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/V2WriteAnalysisSuite.scala

@@ -423,8 +423,8 @@ abstract class V2WriteAnalysisSuiteBase extends AnalysisTest {
    assertNotResolved(parsedPlan)
    assertAnalysisErrorCondition(
      parsedPlan,
-      expectedErrorCondition = "INCOMPATIBLE_DATA_FOR_TABLE.CANNOT_FIND_DATA",
-      expectedMessageParameters = Map("tableName" -> "`table-name`", "colName" -> "`x`")
+      expectedErrorCondition = "INCOMPATIBLE_DATA_FOR_TABLE.EXTRA_COLUMNS",


This is because of spark.sql.defaultColumn.useNullsForMissingDefaultValues and is aligned with V1 writes.

aokolnychyi · 2025-02-21T23:21:07Z

cc @cloud-fan @szehon-ho @amaliujia @gengliangwang @dongjoon-hyun @viirya @huaxingao

cloud-fan · 2025-02-24T13:17:33Z

sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala

@@ -328,7 +336,7 @@ trait AlterTableTests extends SharedSparkSession with QueryErrorsBase {
  }

  test("SPARK-39383 DEFAULT columns on V2 data sources with ALTER TABLE ADD/ALTER COLUMN") {
-    withSQLConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS.key -> s"$v2Format, ") {
+    withSQLConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS.key -> s"$v2Format,$catalog") {


How does this conf affect the testing v2 in-memory catalog? I thought it's only for v1 file source.

I think that it is because https://github.com/apache/spark/pull/50044/files#r1968081855

@viirya is correct. We previously passed "" as provider in the in-memory connector, which required workarounds like this. No longer needed as we pass the catalog name as provider. Simplifies testing.

viirya · 2025-02-24T17:15:12Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala

@@ -122,7 +122,7 @@ class BasicInMemoryTableCatalog extends TableCatalog {
  override def alterTable(ident: Identifier, changes: TableChange*): Table = {
    val table = loadTable(ident).asInstanceOf[InMemoryTable]
    val properties = CatalogV2Util.applyPropertiesChanges(table.properties, changes)
-    val schema = CatalogV2Util.applySchemaChanges(table.schema, changes, None, "ALTER TABLE")
+    val schema = CatalogV2Util.applySchemaChanges(table.schema, changes, Some(name), "ALTER TABLE")


Do we need to add memory table provider to DEFAULT_COLUMN_ALLOWED_PROVIDERS?

Oh, nvm, the name is given when initializing the catalog.

[SPARK-51290][SQL] Enable filling default values in DSv2 writes

62c0840

github-actions bot added the SQL label Feb 21, 2025

aokolnychyi commented Feb 21, 2025

View reviewed changes

Adapt tests

6ced18e

aokolnychyi commented Feb 21, 2025

View reviewed changes

cloud-fan reviewed Feb 24, 2025

View reviewed changes

viirya reviewed Feb 24, 2025

View reviewed changes

aokolnychyi closed this Mar 1, 2025

aokolnychyi reopened this Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51290][SQL] Enable filling default values in DSv2 writes #50044

[SPARK-51290][SQL] Enable filling default values in DSv2 writes #50044

aokolnychyi commented Feb 21, 2025

aokolnychyi Feb 21, 2025

cloud-fan Feb 24, 2025

viirya Feb 24, 2025

viirya Feb 24, 2025

aokolnychyi Feb 28, 2025

aokolnychyi Feb 21, 2025

viirya Feb 24, 2025

aokolnychyi Feb 28, 2025

aokolnychyi Feb 21, 2025

aokolnychyi commented Feb 21, 2025

cloud-fan Feb 24, 2025

viirya Feb 24, 2025

aokolnychyi Feb 28, 2025

viirya Feb 24, 2025

viirya Feb 24, 2025

aokolnychyi Feb 28, 2025

[SPARK-51290][SQL] Enable filling default values in DSv2 writes #50044

Are you sure you want to change the base?

[SPARK-51290][SQL] Enable filling default values in DSv2 writes #50044

Conversation

aokolnychyi commented Feb 21, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Feb 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment