[Spark] Add tests for implicit cast in INSERT to delta table #3605

johanl-db · 2024-08-26T14:34:11Z

Description

Add tests covering implicit casting when inserting into a Delta table.
Covers various insert API:

Dataframe V1, V2, SQL, streaming
Append vs. Overwrite
Position-based vs. name-based

Changes:

Move test abstraction to run insert using various APIs out of TypeWideningInsertSchemaEvolutionSuite and into its own trait to allow reusability.
Add streaming write to the set of insert APIs that are covered by that abstraction.
Add implicit casting tests for insert.

How was this patch tested?

Test-only

Does this PR introduce any user-facing changes?

No

johanl-db · 2024-08-29T09:28:45Z

spark/src/test/scala/org/apache/spark/sql/delta/DeltaInsertIntoTest.scala

+ * Each take a unique path through analysis. The abstractions below captures these different
+ * inserts to allow more easily running tests with all or a subset of them.
+ */
+trait DeltaInsertIntoTest extends QueryTest with DeltaDMLTestUtils with DeltaSQLCommandTest {


This is moved from TypeWideningInsertSchemaEvolutionSuite.

StreamingInsert was added in the process to cover streaming writes, and the test runner method testInserts was updated to make it slightly more expressive

johanl-db · 2024-08-29T09:29:56Z

Tests covering streaming insert assume #3443 is merged, these will fail until that change is actually merged and will succeed afterwards.

c27kwan · 2024-08-29T10:44:26Z

spark/src/test/scala/org/apache/spark/sql/delta/DeltaInsertIntoTest.scala

+  }
+
+  /** df.writeStream.toTable() */
+  object StreamingInsert extends Insert { self: Insert =>


This is structured streaming. Not sure if we also want to test DStreams (Spark Streaming) as well and if it makes a difference since Structured Streaming should be build on top of DStreams.

This is using the default trigger. Do we want to test the behaviour with multiple batches in which only one leads to type widening?

I didn't know about DStreams, I'm not sure whether it can be used directly to write to a Delta table or if that's something we consider should be supported. I would leave it out from this PR

For multiple batches: this PR isn't about type widening but implicit casting.
I have a PR open for type widening in the Delta sink which includes a test where we write multiple batches manually: https://github.com/delta-io/delta/pull/3626/files#diff-82a2720fce9d77504dc9f8e0365dc106e2113b4933d18fec17293b0244504460R197

c27kwan · 2024-08-29T11:01:19Z

spark/src/test/scala/org/apache/spark/sql/delta/DeltaInsertIntoTest.scala

+      initialSchemaDDL: String,
+      initialJsonData: Seq[String],
+      partitionBy: Seq[String] = Seq.empty,
+      insertSchemaDDL: String,
+      insertJsonData: Seq[String],
+      overwriteWhere: (String, Int),
+      expectedSchema: StructType = null,
+      checkError: SparkThrowable => Unit = null,
+      includeInserts: Seq[Insert] = allInsertTypes,
+      excludeInserts: Seq[Insert] = Seq.empty,
+      confs: Seq[(String, String)] = Seq.empty): Unit = {


This list is getting long, and it also seems like maybe some data goes together. E.g. (initialSchemaDDL, initialJsonData) and (insertSchemaDDL, insertJsonData) should have probably been a case class. (includeInserts, excludeInserts) can be 1 variable passed in with the default value of allInsertTypes. This PR is already quite big so if you choose to do this additional refactor it'd be better to do it in another PR.

Good points. I've merged expectedSchema and checkError into a single argument since I'm touching this arg here already but left the other ones as is to avoid having to update all tests and get a massive diff.
I'll keep that in mind to refactor and simplify later on

spark/src/test/scala/org/apache/spark/sql/delta/DeltaInsertIntoTest.scala

…-cast

c27kwan

lgtm!

c27kwan · 2024-09-03T09:12:18Z

spark/src/test/scala/org/apache/spark/sql/delta/DeltaInsertIntoImplicitCastSuite.scala

+class DeltaInsertIntoImplicitCastSuite extends DeltaInsertIntoTest {
+
+  for (schemaEvolution <- BOOLEAN_DOMAIN) {
+    testInserts("insert with implicit up and down cast on top-level fields, " +


nit: isn't there no implicit up and down cast since the schema remains the same?

The schema of the data inserted isn't the same as the schema of the table:
table: a long, b int
data: a int, b long

so there' is an upcast for a and downcast for b

c27kwan · 2024-09-03T09:14:21Z

spark/src/test/scala/org/apache/spark/sql/delta/DeltaInsertIntoImplicitCastSuite.scala

+
+    testInserts("insert with implicit up and down cast on fields nested in map, " +
+      s"schemaEvolution=$schemaEvolution")(
+      initialSchemaDDL = "key int, m map<string, struct<x: long, y: int>>",


Should we test the map's key as well?

This is going to be hard to test here, these tests rely on parsing json to setup the table and ingest data. and I don't believe that supports using anything other than a string for map keys.
That would be a pretty significant rewrite of all the tests

#3762) ## Description Follow on #3605 Adds more tests covering behavior for all ways of running insert with: - an extra column or struct field in the input, in `DeltaInsertIntoSchemaEvolutionSuite` - a missing column or struct field in the input, in `DeltaInsertIntoImplicitCastSuite` - a different column or field ordering than the table schema, in `DeltaInsertIntoColumnOrderSuite` Note: tests are spread across multiple suites as each test case covers 20 different ways to run inserts, quickly leading to large test suites. This change includes improvements to `DeltaInsertIntoTest`: - Group all types of inserts into categories that are easier to reference in tests: - SQL vs. Dataframe inserts - Position-based vs. name-based inserts - Append vs. overwrite - Provide a mechanism to ensure that each test covers all existing insert types. ## How was this patch tested? N/A: test only

johanl-db self-assigned this Aug 29, 2024

johanl-db commented Aug 29, 2024

View reviewed changes

Add tests for implicit casts in batch insert

cd021a9

johanl-db force-pushed the insert-test-implicit-cast branch from 02e09e6 to cd021a9 Compare August 29, 2024 09:32

c27kwan reviewed Aug 29, 2024

View reviewed changes

johanl-db added 2 commits September 2, 2024 10:53

Address comments

5dd8484

Merge remote-tracking branch 'delta/master' into insert-test-implicit…

6826fc2

…-cast

johanl-db requested a review from c27kwan September 2, 2024 09:47

Empty commit to retrigger CI

c738308

c27kwan approved these changes Sep 3, 2024

View reviewed changes

tdas approved these changes Sep 3, 2024

View reviewed changes

tdas merged commit 5568dd0 into delta-io:master Sep 3, 2024
14 of 17 checks passed

johanl-db mentioned this pull request Oct 9, 2024

[Spark] Add INSERT tests with missing, extra, reordered columns/fields #3762

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Add tests for implicit cast in INSERT to delta table #3605

[Spark] Add tests for implicit cast in INSERT to delta table #3605

johanl-db commented Aug 26, 2024

johanl-db Aug 29, 2024

johanl-db commented Aug 29, 2024

c27kwan Aug 29, 2024

johanl-db Sep 2, 2024

c27kwan Aug 29, 2024

johanl-db Sep 2, 2024

c27kwan left a comment

c27kwan Sep 3, 2024

johanl-db Sep 3, 2024

c27kwan Sep 3, 2024

johanl-db Sep 3, 2024

[Spark] Add tests for implicit cast in INSERT to delta table #3605

[Spark] Add tests for implicit cast in INSERT to delta table #3605

Conversation

johanl-db commented Aug 26, 2024

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Choose a reason for hiding this comment

johanl-db commented Aug 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c27kwan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment