Spark 49297 #6

attilapiros · 2024-10-24T21:07:39Z

DO NOT MERGE

### What changes were proposed in this pull request? Trim collation is currently in implementation phase. These change blocks all paths from using it and afterwards trim collation gets enabled for different expressions it will be gradually whitelisted. ### Why are the changes needed? Trim collation is currently in implementation phase. These change blocks all paths from using it and afterwards trim collation gets enabled for different expressions it will be gradually whitelisted. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No additional tests, just added field that's not used. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48336 from jovanpavl-db/block-collation-trim. Lead-authored-by: Jovan Pavlovic <jovan.pavlovic@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…partition columns ### What changes were proposed in this pull request? Provide more user facing error when partition column name can't be found in the table schema. ### Why are the changes needed? There's an issue where partition column sometimes doesn't match any from the table schema. When that happens we throw an assertion error which is not user friendly. Because of that we introduced new `QueryExecutionError` in order to make it more user facing. ### Does this PR introduce _any_ user-facing change? Yes, users will get more user friendly error message. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48338 from mihailoale-db/mihailoale-db/fixdescribepartitioningmessage. Authored-by: Mihailo Aleksic <mihailo.aleksic@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This patch moves `iterator.hasNext` into the try block of `tryWithSafeFinallyAndFailureCallbacks` in `FileFormatWriter.executeTask`. ### Why are the changes needed? Not only `dataWriter.writeWithIterator(iterator)` causes error, `iterator.hasNext` could cause error like: ``` org.apache.spark.shuffle.FetchFailedException: Block shuffle_1_106_21 is corrupted but checksum verification passed ``` As it is not wrapped in the try block, `abort` won't be called on the committer. But as `setupTask` is called, it is safer to call `abort` in any case that error happens after it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48360 from viirya/try_block. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: huaxingao <huaxin.gao11@gmail.com>

…sformWithStateInPandas ### What changes were proposed in this pull request? Implement TTL support for ListState in TransformWithStateInPandas. ### Why are the changes needed? Allow users to add TTL to specific list state. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48253 from bogao007/ttl-list-state. Authored-by: bogao007 <bo.gao@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…th bad timezone ### What changes were proposed in this pull request? This PR proposes to fix the uncaught Java exception from `make_timestamp()` with bad timezone ### Why are the changes needed? To improve the error message ### Does this PR introduce _any_ user-facing change? No API changes, but the user-facing error message is changed: **Before** ``` spark-sql (default)> select make_timestamp(1, 2, 28, 23, 1, 1, -100); Invalid ID for ZoneOffset, invalid format: -100 ``` **After** ``` spark-sql (default)> select make_timestamp(1, 2, 28, 23, 1, 1, -100); [INVALID_TIMEZONE] The timezone: -100 is invalid. The timezone must be either a region-based zone ID or a zone offset. Region IDs must have the form ‘area/city’, such as ‘America/Los_Angeles’. Zone offsets must be in the format ‘(+|-)HH’, ‘(+|-)HH:mm’ or ‘(+|-)HH:mm:ss’, e.g ‘-08’, ‘+01:00’ or ‘-13:33:33’., and must be in the range from -18:00 to +18:00. 'Z' and 'UTC' are accepted as synonyms for '+00:00'. SQLSTATE: 22009 ``` ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48260 from itholic/SPARK-49773. Lead-authored-by: Haejoon Lee <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <haejoon@apache.org> Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>

… lazy vals" This reverts commit d7abddc. We had offline discussion, and JoshRosen pointed out that: > The use of LazyTry vs. a non-try Lazy wrappers needs discussion: LazyTry caches failures. As a result, there is a potential risk of cancellation exceptions being cached here. This is possibly an issue in case AQE interrupts subquery execution threads during planning, or otherwise can interrupt a subset of a query plan. Closes apache#48362 from zhengruifeng/revert_fix_deadlock. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…class ### What changes were proposed in this pull request? Extract the preparation of df.sample to parent class ### Why are the changes needed? deduplicate codes ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48365 from zhengruifeng/py_sql_sample. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…sRuntime argument ### What changes were proposed in this pull request? The proposal is to update the classifyException function so that it can return either ```AnalysisException``` or ```SparkRuntimeException```. This is achieved by adding a new parameter, ```isRuntime```, and modifying the return type to be ```Throwable with SparkThrowable``` for compatibility with both types. ### Why are the changes needed? The changes are needed to allow the classifyException function to be used in execution part of the code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Not needed. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48351 from ivanjevtic-db/Change-classify-exception-function-signature. Lead-authored-by: ivanjevtic-db <ivan.jevtic@databricks.com> Co-authored-by: Ivan Jevtic <ivan.jevtic@databricks.com> Co-authored-by: milastdbx <milan.stefanovic@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? Apache Spark 4.0.0 RC1 vote will start on February 15th 2025 - https://spark.apache.org/versioning-policy.html This PR aims to update K8s docs to recommend K8s v1.29+ for Apache Spark 4.0.0. ### Why are the changes needed? **1. K8s community will release v1.32.0 on 2024-12-11** - https://github.com/kubernetes/sig-release/tree/master/releases/release-1.32#kubernetes-132 **2. Default K8s Versions in Public Cloud environments** The default K8s versions of public cloud providers are already moving to K8s 1.30 like the following. - EKS: v1.30 (Default) - GKE: v1.30 (Stable), v1.30 (Regular), v1.31 (Rapid) - AKS: v1.29 (Default), v1.30 (Support) **3. End Of Support** In addition, K8s 1.28 reached or will reach a standard support EOL around Apache Spark 4.0.0 release. | K8s | EKS | GKE | AKS | | ---- | ------- | ------- | ------- | | 1.27 | 2024-11 | 2025-02-04 | 2025-03 | - [EKS EOL Schedule](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar) - [AKS EOL Schedule](https://docs.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar) - [GKE EOL Schedule](https://cloud.google.com/kubernetes-engine/docs/release-schedule) ### Does this PR introduce _any_ user-facing change? - No, this is a documentation-only change about K8s versions. - Apache Spark K8s Integration Test is currently using K8s **v1.31.0** on Minikube already. ``` * Preparing Kubernetes v1.31.0 on Docker 27.2.0 ... ``` ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48371 from dongjoon-hyun/SPARK-49896. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Replace `TreeNode.allChildren` with identity map (which was `Set` previously) to avoid hashcode lazy evalution. ### Why are the changes needed? We hit a deadlock between shuffle dependency initialization and explain string generation, in which both code paths trigger lazy variable instantiation while visiting some tree nodes in different orders and thus acquiring object locks in a reversed order. The hashcode of plan node is implemented as lazy val. So the fix is to remove hash code computation from explain string generation so to break the chain of lazy variable instantiation. `TreeNode.allChildren` is only used in explain string generation and only require identity equality. This should be also a small perforamance improvement BTW. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Current UTs ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#48375 from liuzqt/SPARK-49852. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…lumn field operations ### What changes were proposed in this pull request? Refine the string representation of column field operations: `GetField`, `WithField`, and `DropFields` ### Why are the changes needed? make the string representations consistent between pyspark classic and connect ### Does this PR introduce _any_ user-facing change? yes before ``` In [1]: from pyspark.sql import functions as sf In [2]: c = sf.col("c") In [3]: c.x Out[3]: Column<'UnresolvedExtractValue(c, x)'> ``` after ``` In [1]: from pyspark.sql import functions as sf In [2]: c = sf.col("c") In [3]: c.x Out[3]: Column<'c['x']'> ``` ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48369 from zhengruifeng/py_connect_col_str. Lead-authored-by: Ruifeng Zheng <ruifengz@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR adds SQL pipe syntax support for the JOIN operator. For example: ``` CREATE TEMPORARY VIEW join_test_t1 AS SELECT * FROM VALUES (1) AS grouping(a); CREATE TEMPORARY VIEW join_test_empty_table AS SELECT a FROM join_test_t1 WHERE FALSE; TABLE join_test_t1 |> FULL OUTER JOIN join_test_empty_table ON (join_test_t1.a = join_test_empty_table.a); 1 NULL ``` ### Why are the changes needed? The SQL pipe operator syntax will let users compose queries in a more flexible fashion. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds a few unit test cases, but mostly relies on golden file test coverage. I did this to make sure the answers are correct as this feature is implemented and also so we can look at the analyzer output plans to ensure they look right as well. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48270 from dtenedor/pipe-join. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…rim` ### What changes were proposed in this pull request? Add argument `trim` for functions`trim/ltrim/rtrim` ### Why are the changes needed? this argument is missing in PySpark: we can specify the it in scala side but cannot do it in python. ### Does this PR introduce _any_ user-facing change? yes, new argument supported ### How was this patch tested? added doctests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48363 from zhengruifeng/func_trim_str. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…metrics accumulator logging flag from SPARK-42204 ### What changes were proposed in this pull request? This PR corrects an unintentional default behavior change from apache#39763 That PR introduced a new configuration, `spark.eventLog.includeTaskMetricsAccumulators`, to provide an ability for users to disable the redundant logging of task metrics information via the Accumulables field in the Spark event log task end logs. I made a mistake in updating that PR description and code from the original version: the description says that the intent is to not change out of the box behavior, but the actual flag default was the opposite. This new PR corrects both the flag default and the flag description to reflect the original intent of not changing default behavior. ### Why are the changes needed? Roll back an unintentional behavior change. ### Does this PR introduce _any_ user-facing change? Yes, it rolls back an unintentional default behavior change. ### How was this patch tested? Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48372 from JoshRosen/fix-event-log-accumulable-defaults. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? The pr aims to upgrade `Parquet` from `1.14.2` to `1.14.3`. ### Why are the changes needed? The full release notes: https://github.com/apache/parquet-java/releases/tag/apache-parquet-1.14.3 apache/parquet-java#3007: Ensure version specific Jackson classes are shaded apache/parquet-java#3013: Fix potential ClassCastException at reading DELTA_BYTE_ARRAY encoding ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48378 from panbingkun/SPARK-49903. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This pr aims to upgrade `dropwizard metrics` from `4.2.27` to `4.2.28`. ### Why are the changes needed? v4.2.127 VS v.4.2.28 dropwizard/metrics@v4.2.27...v4.2.28 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48377 from panbingkun/SPARK-49901. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… error for PartitioningUtils ### What changes were proposed in this pull request? Improve Spark user experience by introducing a new error type: `CONFLICTING_DIRECTORY_STRUCTURES` for `PartitioningUtils`. ### Why are the changes needed? `PartitioningUtils.parsePartitions(...)` uses an assertion to if partitions are misconfigured. We should use a proper error type for this case. ### Does this PR introduce _any_ user-facing change? Yes, the error will be nicer. ### How was this patch tested? Updated the existing tests. ### Was this patch authored or co-authored using generative AI tooling? `copilot.vim`. Closes apache#48383 from vladimirg-db/vladimirg-db/introduce-conflicting-directory-structures-error. Authored-by: Vladimir Golubev <vladimir.golubev@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR does two things: - It adds shims for SparkContext and RDD. These are in a separate module. This module is a compile time dependency for sql/api, and a regular dependency for connector/connect/client/jvm. We remove this dependency in catalyst and connect-server because those should use the actual implementation. - It adds RDD (and the one SparkContext) based method to the shared Scala API. For connect these methods throw an unsupported operation exception. ### Why are the changes needed? We are creating a shared Scala interface for Classic and Connect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. I will add a couple on the connect side. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48065 from hvanhovell/SPARK-49569. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…SELECT clause ### What changes were proposed in this pull request? Introduced a specific error message for cases where a trailing comma appears at the end of the SELECT clause. ### Why are the changes needed? The previous error message was unclear and often pointed to an incorrect location in the query, leading to confusion. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48370 from stefankandic/fixTrailingComma. Lead-authored-by: Stefan Kandic <stefan.kandic@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… keep the behavior same ### What changes were proposed in this pull request? This PR is a followup of apache#47688 that keeps `Column.toString` as the same before. ### Why are the changes needed? To keep the same behaviour with Spark Classic and Connect. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released out yet. ### How was this patch tested? Will be added separately. I manually tested: ```scala import org.apache.spark.sql.functions.col val name = "with`!#$%dot".replace("`", "``") col(s"`${name}`").toString.equals("with`!#$%dot") ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48376 from HyukjinKwon/SPARK-49022-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ty of nested structs ### What changes were proposed in this pull request? - Fixes a bug in `NormalizeFloatingNumbers` to respect the `nullable` attribute of nested expressions when normalizing. ### Why are the changes needed? - Without the fix, there would be a degradation in the nullability of the expression post normalization. - For example, for an expression like: `namedStruct("struct", namedStruct("double", <DoubleType-field>)) ` with the following data type: ``` StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), false, {})) ``` after normalizing we would have ended up with the dataType: ``` StructType(StructField("struct", StructType(StructField("double", DoubleType, true, {})), true, {})) ``` Note, the change in the `nullable` attribute of the "double" StructField from `false` to `true`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Added unit test. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48331 from nikhilsheoran-db/SPARK-49863-fix. Authored-by: Nikhil Sheoran <125331115+nikhilsheoran-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR adds SQL pipe syntax support for the set operations: UNION, INTERSECT, EXCEPT, DISTINCT. For example: ``` CREATE TABLE t(x INT, y STRING) USING CSV; INSERT INTO t VALUES (0, 'abc'), (1, 'def'); TABLE t |> UNION ALL (SELECT * FROM t); 0 abc 0 abc 1 def 1 def 1 NULL ``` ### Why are the changes needed? The SQL pipe operator syntax will let users compose queries in a more flexible fashion. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds a few unit test cases, but mostly relies on golden file test coverage. I did this to make sure the answers are correct as this feature is implemented and also so we can look at the analyzer output plans to ensure they look right as well. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48359 from dtenedor/pipe-union. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? The pr aims to fix the `pretty name` of some `expressions`, includes: `random`, `to_varchar`, `current_database`, `curdate`, `dateadd` and `array_agg`. ### Why are the changes needed? The actual function name used does not match the displayed name, as shown below: - Before: <img width="573" alt="image" src="https://github.com/user-attachments/assets/f5785c80-f6cb-494f-a15e-9258eca688a7"> - After: <img width="570" alt="image" src="https://github.com/user-attachments/assets/792a7092-ccbf-49f4-a616-19110e5c2361"> ### Does this PR introduce _any_ user-facing change? Yes, Make the header of the data seen by the end-user from `Spark SQL` consistent with the `actual function name` used. ### How was this patch tested? - Pass GA. - Update existed UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48385 from panbingkun/SPARK-49909. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This PR adds interfaces for SparkSession Thread Locals. ### Why are the changes needed? We are creating a unified Spark SQL Scala interface. This is part of that effort. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48374 from hvanhovell/SPARK-49418. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? Currently, when running `Dataset.localCheckpoint(eager = true)`, it is impossible to specify a non-default StorageLevel for the checkpoint. On the other hand it is possible with Dataset cache by using `Dataset.persist(newLevel: StorageLevel)`. If one wants to specify a non-default StorageLevel for localCheckpoint, it currently needs accessing the plan, changing the level, and then triggering an action to materialize checkpoint: ``` // start lazy val checkpointDf = df.localCheckpoint(eager = false) // fish out the RDD val checkpointPlan = checkpointedSourcePlanDF.queryExecution.analyzed val rdd = checkpointedPlan.asInstanceOf[LogicalRDD].rdd // change the StorageLevel rdd.persist(StorageLevel.DISK_ONLY) // force materialization checkpointDf .mapPartitions(_ => Iterator.empty.asInstanceOf[Iterator[Row]]) .foreach((_: Row) => ()) ``` There are several issues with this: 1. Won't work with Connect as we don't have access to RDD internals 2. Lazy checkpoint is not in fact lazy when AQE is involved. In order to get the RDD of a lazy checkpoint, AQE will actually trigger execution of all the query stages except the result stage in order to get the final plan. So the `start lazy` phase will already execute everything except the final stage, and then `force materialization` will only execute result stage. This is "unexpected" and makes it more difficult to debug, first showing a query with missing metrics for the final stage, and then another query that skipped everything and only ran final stage. Having an API to specify storageLevel for localCheckpoint will help avoid such hacks. As a precedent, it is already possible to specify StorageLevel for Dataset cache by using `Dataset.persist(newLevel: StorageLevel)`. In this PR, I implement this API for scala and python, and classic and connect. ### Why are the changes needed? https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/commands/merge/MergeIntoMaterializeSource.scala in `prepareMergeSource` has to do hacks as described above to use localCheckpoint with non-default StorageLevel. It is hacky, and confusing that it then records two separate executions as described above. ### Does this PR introduce _any_ user-facing change? Yes. Adds API to pass `storageLevel` to Dataset `localCheckpoint`. ### How was this patch tested? Tests added. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Github Copilot (trivial code completions) Closes apache#48324 from juliuszsompolski/SPARK-49857. Authored-by: Julek Sompolski <Juliusz Sompolski> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR proposes to assign proper error class for _LEGACY_ERROR_TEMP_1325 ### Why are the changes needed? To improve user facing error message by providing proper error condition and sql state ### Does this PR introduce _any_ user-facing change? Improve user-facing error message ### How was this patch tested? Updated the existing UT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48346 from itholic/legacy_1325. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…to the `spark-connect-shims` module to fix Maven build errors ### What changes were proposed in this pull request? This PR adds `scala-library` maven dependency to the `spark-connect-shims` module to fix Maven build errors. ### Why are the changes needed? Maven daily test pipeline build failed: - https://github.com/apache/spark/actions/runs/11255598249 - https://github.com/apache/spark/actions/runs/11256610976 ``` scaladoc error: fatal error: object scala in compiler mirror not found. Error: Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.9.1:doc-jar (attach-scaladocs) on project spark-connect-shims_2.13: MavenReportException: Error while creating archive: wrap: Process exited with an error: 1 (Exit value: 1) -> [Help 1] Error: Error: To see the full stack trace of the errors, re-run Maven with the -e switch. Error: Re-run Maven using the -X switch to enable full debug logging. Error: Error: For more information about the errors and possible solutions, please read the following articles: Error: [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException Error: Error: After correcting the problems, you can resume the build with the command Error: mvn <args> -rf :spark-connect-shims_2.13 Error: Process completed with exit code 1. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - locally test: ``` build/mvn clean install -DskipTests -Phive ``` **Before** ``` [INFO] --- scala:4.9.1:doc-jar (attach-scaladocs) spark-connect-shims_2.13 --- scaladoc error: fatal error: object scala in compiler mirror not found. [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for Spark Project Parent POM 4.0.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ........................... SUCCESS [ 2.833 s] [INFO] Spark Project Tags ................................. SUCCESS [ 5.292 s] [INFO] Spark Project Sketch ............................... SUCCESS [ 5.675 s] [INFO] Spark Project Common Utils ......................... SUCCESS [ 16.762 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 7.735 s] [INFO] Spark Project Networking ........................... SUCCESS [ 11.389 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 9.159 s] [INFO] Spark Project Variant .............................. SUCCESS [ 3.618 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 9.692 s] [INFO] Spark Project Connect Shims ........................ FAILURE [ 2.478 s] [INFO] Spark Project Launcher ............................. SKIPPED [INFO] Spark Project Core ................................. SKIPPED [INFO] Spark Project ML Local Library ..................... SKIPPED [INFO] Spark Project GraphX ............................... SKIPPED [INFO] Spark Project Streaming ............................ SKIPPED [INFO] Spark Project SQL API .............................. SKIPPED [INFO] Spark Project Catalyst ............................. SKIPPED [INFO] Spark Project SQL .................................. SKIPPED [INFO] Spark Project ML Library ........................... SKIPPED [INFO] Spark Project Tools ................................ SKIPPED [INFO] Spark Project Hive ................................. SKIPPED [INFO] Spark Project Connect Common ....................... SKIPPED [INFO] Spark Avro ......................................... SKIPPED [INFO] Spark Protobuf ..................................... SKIPPED [INFO] Spark Project REPL ................................. SKIPPED [INFO] Spark Project Connect Server ....................... SKIPPED [INFO] Spark Project Connect Client ....................... SKIPPED [INFO] Spark Project Assembly ............................. SKIPPED [INFO] Kafka 0.10+ Token Provider for Streaming ........... SKIPPED [INFO] Spark Integration for Kafka 0.10 ................... SKIPPED [INFO] Kafka 0.10+ Source for Structured Streaming ........ SKIPPED [INFO] Spark Project Examples ............................. SKIPPED [INFO] Spark Integration for Kafka 0.10 Assembly .......... SKIPPED [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:15 min [INFO] Finished at: 2024-10-09T23:43:58+08:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.9.1:doc-jar (attach-scaladocs) on project spark-connect-shims_2.13: MavenReportException: Error while creating archive: wrap: Process exited with an error: 1 (Exit value: 1) -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn <args> -rf :spark-connect-shims_2.13 ``` **After** ``` [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for Spark Project Parent POM 4.0.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ........................... SUCCESS [ 2.766 s] [INFO] Spark Project Tags ................................. SUCCESS [ 5.398 s] [INFO] Spark Project Sketch ............................... SUCCESS [ 6.361 s] [INFO] Spark Project Common Utils ......................... SUCCESS [ 16.919 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 8.083 s] [INFO] Spark Project Networking ........................... SUCCESS [ 11.240 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 9.438 s] [INFO] Spark Project Variant .............................. SUCCESS [ 3.697 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 9.939 s] [INFO] Spark Project Connect Shims ........................ SUCCESS [ 2.938 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 6.502 s] [INFO] Spark Project Core ................................. SUCCESS [01:33 min] [INFO] Spark Project ML Local Library ..................... SUCCESS [ 18.220 s] [INFO] Spark Project GraphX ............................... SUCCESS [ 20.923 s] [INFO] Spark Project Streaming ............................ SUCCESS [ 29.949 s] [INFO] Spark Project SQL API .............................. SUCCESS [ 25.842 s] [INFO] Spark Project Catalyst ............................. SUCCESS [02:02 min] [INFO] Spark Project SQL .................................. SUCCESS [02:18 min] [INFO] Spark Project ML Library ........................... SUCCESS [01:38 min] [INFO] Spark Project Tools ................................ SUCCESS [ 3.365 s] [INFO] Spark Project Hive ................................. SUCCESS [ 45.357 s] [INFO] Spark Project Connect Common ....................... SUCCESS [ 33.636 s] [INFO] Spark Avro ......................................... SUCCESS [ 22.040 s] [INFO] Spark Protobuf ..................................... SUCCESS [ 24.557 s] [INFO] Spark Project REPL ................................. SUCCESS [ 13.843 s] [INFO] Spark Project Connect Server ....................... SUCCESS [ 35.587 s] [INFO] Spark Project Connect Client ....................... SUCCESS [ 33.929 s] [INFO] Spark Project Assembly ............................. SUCCESS [ 5.121 s] [INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [ 12.623 s] [INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 16.908 s] [INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [ 23.664 s] [INFO] Spark Project Examples ............................. SUCCESS [ 30.777 s] [INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 6.997 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 15:40 min [INFO] Finished at: 2024-10-09T23:27:20+08:00 [INFO] ------------------------------------------------------------------------ ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48399 from LuciferYang/SPARK-49569-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…ove performance of `DeduplicateRelations` ### What changes were proposed in this pull request? This PR replaces `HashSet` that is currently used with a `HashMap` to improve `DeduplicateRelations` performance. Additionally, this PR reverts apache#48053 as that change is no longer needed ### Why are the changes needed? Current implementation doesn't utilize `HashSet` properly, but instead performs multiple linear searches on the set creating a O(n^2) complexity ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? Closes apache#48392 from mihailotim-db/mihailotim-db/master. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…DF_with_schema_string` ### What changes were proposed in this pull request? Reduce the python worker error log of `test_toDF_with_schema_string` ### Why are the changes needed? When I run the test locally ```python python/run-tests -k --python-executables python3 --testnames 'pyspark.sql.tests.test_dataframe' ``` Two assertions in `test_toDF_with_schema_string` generate too many python worker error logs (~1k lines), which easily exceed the limitation of terminal and make it hard to debug. So I want to reduce the number of python workers in the two assertions. ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? manually test, the logs will be reduced to ~200 lines ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48388 from zhengruifeng/test_to_df_error. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…quired from stateful operators ### What changes were proposed in this pull request? This PR proposes to use different ShuffleOrigin for the shuffle required from stateful operators. Spark has been using ENSURE_REQUIREMENTS as ShuffleOrigin which is open for optimization e.g. AQE can adjust the shuffle spec. Quoting the code of ENSURE_REQUIREMENTS: ``` // Indicates that the shuffle operator was added by the internal `EnsureRequirements` rule. It // means that the shuffle operator is used to ensure internal data partitioning requirements and // Spark is free to optimize it as long as the requirements are still ensured. case object ENSURE_REQUIREMENTS extends ShuffleOrigin ``` But the distribution requirement for stateful operators is lot more strict - it has to use the all expressions to calculate the hash (for partitioning) and the number of shuffle partitions must be the same with the spec. This is because stateful operator assumes that there is 1:1 mapping between the partition for the operator and the "physical" partition for checkpointed state. That said, it is fragile if we allow any optimization to be made against shuffle for stateful operator. To prevent this, this PR introduces a new ShuffleOrigin with note that the shuffle is not expected to be "modified". ### Why are the changes needed? This exposes a possibility of broken state based on the contract. We introduced StatefulOpClusteredDistribution in similar reason. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48382 from HeartSaVioR/SPARK-49905. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

### What changes were proposed in this pull request? Support for interval types was added to the variant spec. This PR removes this support and removes the ability to cast from interval types to variant and vice versa. ### Why are the changes needed? I implemented interval support for Variant before, but because the Variant spec type is supposed to be open and compatible with other engines which may not support all the ANSI Interval types, more thought needs to be put into the design of these intervals in Variant. ### Does this PR introduce _any_ user-facing change? Yes, after this change, users would no longer be able to cast between variants and intervals. ### How was this patch tested? Unit tests making sure that 1. It is not possible to construct variants containing intervals. 2. It is not possible to cast variants to intervals. 3. Interval IDs in variants are treated just like other unknown type IDs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48215 from harshmotw-db/harshmotw-db/disable_interval_2. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… without optional dependencies ### What changes were proposed in this pull request? This PR is a followup of apache#48587 that proposes to make pysaprk-ml tests passing without optional dependencies ### Why are the changes needed? To make the tests passing without optional dependencies. See https://github.com/apache/spark/actions/runs/11447673972/job/31849621508 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ran it locally ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48606 from HyukjinKwon/SPARK-50064-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…oom filter in BloomFilterBenchmark ### What changes were proposed in this pull request? Parquet's AdaptiveBlockSplitBloomFilter is a technique for generating a bloom filter with the optimal bit size according to the number of distinct real data values. It may not come at no cost because it uses multiple BloomFilter candidates at runtime, which could increase CPU usage or time. This pull request adds benchmark cases to compare with those that use the default BloomFilter size. ### Why are the changes needed? Improvement benchmark coverage for common user-orient features from parquet datasource ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? benchmarking golden files attached ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48609 from yaooqinn/SPARK-50080. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…eReplaceable) ### What changes were proposed in this pull request? The pr aims to add `Codegen` Support for `schema_of_csv`. ### Why are the changes needed? - improve codegen coverage. - simplified code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA & Existed UT (eg: CsvFunctionsSuite#`*schema_of_csv*`) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48595 from panbingkun/SPARK-50067. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ROR_TEMP_0038`: `DUPLICATED_CTE_NAMES` ### What changes were proposed in this pull request? This PR proposes to assign proper error condition & sqlstate for `_LEGACY_ERROR_TEMP_0038`: `DUPLICATED_CTE_NAMES` ### Why are the changes needed? To improve the error message by assigning proper error condition and SQLSTATE ### Does this PR introduce _any_ user-facing change? No, only user-facing error message improved ### How was this patch tested? Updated the existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48607 from itholic/LEGACY_0038. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? Earlier, runtime errors in underlying libraries were not caught during runtime in the RegExpReplace expression. The underlying errors were thrown directly to the user. For example, it wouldn't be uncommon to see issues like `java.lang.IndexOutOfBoundsException: No group 3`. This PR introduces a change to catch these underlying issues and throw a SparkException instead which details the input on which the exception failed. The new Spark Exception looks something like `org.apache.spark.SparkException: Could not perform regexp_replace for source = <source>, pattern = <pattern>, replacement = <replacement> and position = <position>`. ### Why are the changes needed? Two reasons. First, the new exception details which row the given error occurred on, which makes it easier for the user to debug the query or Spark developers to identify bugs. Second, a Spark Exception is generally considered expected behavior indicating that there were no unintended issues in the query's execution. ### Does this PR introduce _any_ user-facing change? Yes, a better exception is thrown when RegExpReplace fails. ### How was this patch tested? Unit test in both codegen as well as interpreted mode. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48379 from harshmotw-db/harshmotw-db/regexp_replace_fix. Lead-authored-by: Harsh Motwani <harsh.motwani@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…d `datepart` ### What changes were proposed in this pull request? Fix the type hint for `extract`, `date_part` and `datepart` ### Why are the changes needed? argument `field` never supports column name: ```python In [6]: df = spark.createDataFrame([(datetime.datetime(2015, 4, 8, 13, 8, 15), "YEAR",)], ['ts', 'field']) In [7]: df.select(sf.extract("field", "ts")) ... AnalysisException: [NON_FOLDABLE_ARGUMENT] The function `extract` requires the parameter `field` to be a foldable expression of the type "STRING", but the actual argument is a non-foldable. SQLSTATE: 42K08 ``` ### Does this PR introduce _any_ user-facing change? yes, doc only change ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48613 from zhengruifeng/fix_extract_hint. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? This PR updates the suggested fix of `INVALID_URL` error to use `try_parse_url` function added in [this](apache#48500) PR instead of turning off ANSI mode. ### Why are the changes needed? INVALID_URL contains suggested fix for turning off ANSI mode. Now that in Spark 4.0.0 we have moved to ANSI mode on by default, we want to keep suggestions of this kind to the minimum. There exist implementations of try_* functions which provide safe way to get behavior as for ANSI mode off and suggestions of this kind should be sufficient. In this case, try expressions were missing so new expressions were added to patch up the missing implementations. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? There are tests that check error messages. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48616 from jovanm-db/improvedError. Authored-by: Jovan Markovic <jovan.markovic@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ROR_TEMP_3168`: `MISSING_TIMEOUT_CONFIGURATION` ### What changes were proposed in this pull request? This PR proposes to assign proper error condition & sqlstate for `_LEGACY_ERROR_TEMP_3168`: `MISSING_TIMEOUT_CONFIGURATION` ### Why are the changes needed? To improve the error message by assigning proper error condition and SQLSTATE ### Does this PR introduce _any_ user-facing change? No, only user-facing error message improved ### How was this patch tested? Updated the existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48620 from itholic/LEGACY_3168. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? Refactoring of `UnresolvedStarBase.expand(...)` ### Why are the changes needed? Refactoring is needed for the Single-pass Analyzer project (please check [link](https://issues.apache.org/jira/browse/SPARK-49834)) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? Github Copilot Closes apache#48619 from mihailoale-db/mihailoale-db/refactorstarexpand. Authored-by: Mihailo Aleksic <mihailo.aleksic@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

… functions ### What changes were proposed in this pull request? Refine the docstring of multiple datetime functions ### Why are the changes needed? for better documentation and doctest coverage ### Does this PR introduce _any_ user-facing change? doc change ### How was this patch tested? new doctests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48617 from zhengruifeng/py_date_func_i. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…anRelationPushDown ### What changes were proposed in this pull request? Add the timezone information to a cast expression when the destination type requires it. ### Why are the changes needed? When current_timestamp() is materialized as a string, the timezone information is gone (e.g., 2024-12-27 10:26:27.684158) which prohibits further optimization rules from being applied to the affected data source. For example, ``` Project [1735900357973433#10 AS current_timestamp()#6] +- 'Project [cast(2025-01-03 10:32:37.973433#11 as timestamp) AS 1735900357973433#10] +- RelationV2[2025-01-03 10:32:37.973433#11] xxx ``` -> This query fails to execute because the injected cast expression lacks the timezone information. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49549 from changgyoopark-db/SPARK-50870. Authored-by: changgyoopark-db <changgyoo.park@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

jovanpavl-db and others added 30 commits October 5, 2024 09:16

harshmotw-db and others added 11 commits October 23, 2024 09:41

github-actions bot added CORE PYTHON PANDAS API ON SPARK INFRA BUILD DOCS SQL AVRO ML STRUCTURED STREAMING R YARN KUBERNETES WEB UI CONNECT SPARK SHELL MLLIB labels Oct 24, 2024

attilapiros merged commit d65c85e into master Oct 24, 2024
9 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 49297 #6

Spark 49297 #6

attilapiros commented Oct 24, 2024

Spark 49297 #6

Spark 49297 #6

Conversation

attilapiros commented Oct 24, 2024