Update upstream #35

GulajavaMinistudio · 2017-05-04T05:00:57Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

…ntext when stopping it ## What changes were proposed in this pull request? To better understand this problem, let's take a look at an example first: ``` object Main { def main(args: Array[String]): Unit = { var t = new Test new Thread(new Runnable { override def run() = {} }).start() println("first thread finished") t.a = null t = new Test new Thread(new Runnable { override def run() = {} }).start() } } class Test { var a = new InheritableThreadLocal[String] { override protected def childValue(parent: String): String = { println("parent value is: " + parent) parent } } a.set("hello") } ``` The result is: ``` parent value is: hello first thread finished parent value is: hello parent value is: hello ``` Once an `InheritableThreadLocal` has been set value, child threads will inherit its value as long as it has not been GCed, so setting the variable which holds the `InheritableThreadLocal` to `null` doesn't work as we expected. In `SparkContext`, we have an `InheritableThreadLocal` for local properties, we should clear it when stopping `SparkContext`, or all the future child threads will still inherit it and copy the properties and waste memory. This is the root cause of https://issues.apache.org/jira/browse/SPARK-20548 , which creates/stops `SparkContext` many times and finally have a lot of `InheritableThreadLocal` alive, and cause OOM when starting new threads in the internal thread pools. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #17833 from cloud-fan/core.

It is not valid to eagerly bind with the child's output as this causes failures when we attempt to canonicalize the plan (replacing the attribute references with dummies). Author: Michael Armbrust <michael@databricks.com> Closes #17838 from marmbrus/fixBindExplode.

…CA (v2) Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only). Based on #7963, updated. ## How was this patch tested? New doc tests and unit tests. Ran all examples locally. Author: MechCoder <manojkumarsivaraj334@gmail.com> Author: Nick Pentreath <nickp@za.ibm.com> Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca.

## What changes were proposed in this pull request? Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17803 from srowen/SPARK-20523.

## What changes were proposed in this pull request? Use midpoints for split values now, and maybe later to make it weighted. ## How was this patch tested? + [x] add unit test. + [x] revise Split's unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Author: 颜发才（Yan Facai） <facai.yan@gmail.com> Closes #17556 from facaiy/ENH/decision_tree_overflow_and_precision_in_aggregation.

…treamingRelation should only be transformed to one StreamingExecutionRelation ## What changes were proposed in this pull request? Within the same streaming query, when one `StreamingRelation` is referred multiple times – e.g. `df.union(df)` – we should transform it only to one `StreamingExecutionRelation`, instead of two or more different `StreamingExecutionRelation`s (each of which would have a separate set of source, source logs, ...). ## How was this patch tested? Added two test cases, each of which would fail without this patch. Author: Liwei Lin <lwlin7@gmail.com> Closes #17735 from lw-lin/SPARK-20441.

## What changes were proposed in this pull request? We allow users to specify hints (currently only "broadcast" is supported) in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), DataFrame doesn't have one and sometimes users are confused that they can't find how to apply a broadcast hint. This ticket adds a generic hint function on DataFrame that allows using the same hint on DataFrames as well as SQL. As an example, after this patch, the following will apply a broadcast hint on a DataFrame using the new hint function: ``` df1.join(df2.hint("broadcast")) ``` ## How was this patch tested? Added a test case in DataFrameJoinSuite. Author: Reynold Xin <rxin@databricks.com> Closes #17839 from rxin/SPARK-20576.

… when reading FileStreamSink's output ## The Problem Right now DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output: ``` [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** (3 seconds, 928 milliseconds) [info] java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths: [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637 [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata [info] [info] If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them. [info] at scala.Predef$.assert(Predef.scala:170) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156) [info] at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55) [info] at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133) [info] at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) [info] at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) ``` ## What changes were proposed in this pull request? This patch alters `InMemoryFileIndex` to filter out these `basePath`s whose ancestor is the streaming metadata dir (`_spark_metadata`). E.g., the following and other similar dir or files will be filtered out: - (introduced by globbing `basePath/*`) - `basePath/_spark_metadata` - (introduced by globbing `basePath/*/*`) - `basePath/_spark_metadata/0` - `basePath/_spark_metadata/1` - ... ## How was this patch tested? Added unit tests Author: Liwei Lin <lwlin7@gmail.com> Closes #17346 from lw-lin/filter-metadata.

…test and add a test for =!= ## What changes were proposed in this pull request? This PR proposes three things as below: - This test looks not testing `<=>` and identical with the test above, `===`. So, it removes the test. ```diff - test("<=>") { - checkAnswer( - testData2.filter($"a" === 1), - testData2.collect().toSeq.filter(r => r.getInt(0) == 1)) - - checkAnswer( - testData2.filter($"a" === $"b"), - testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1))) - } ``` - Replace the test title from `=!=` to `<=>`. It looks the test actually testing `<=>`. ```diff + private lazy val nullData = Seq( + (Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b") + ... - test("=!=") { + test("<=>") { - val nullData = spark.createDataFrame(sparkContext.parallelize( - Row(1, 1) :: - Row(1, 2) :: - Row(1, null) :: - Row(null, null) :: Nil), - StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType)))) - checkAnswer( nullData.filter($"b" <=> 1), ... ``` - Add the tests for `=!=` which looks not existing. ```diff + test("=!=") { + checkAnswer( + nullData.filter($"b" =!= 1), + Row(1, 2) :: Nil) + + checkAnswer(nullData.filter($"b" =!= null), Nil) + + checkAnswer( + nullData.filter($"a" =!= $"b"), + Row(1, 2) :: Nil) + } ``` ## How was this patch tested? Manually running the tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17842 from HyukjinKwon/minor-test-fix.

## What changes were proposed in this pull request? Adds `hint` method to PySpark `DataFrame`. ## How was this patch tested? Unit tests, doctests. Author: zero323 <zero323@users.noreply.github.com> Closes #17850 from zero323/SPARK-20584.

## What changes were proposed in this pull request? General rule on skip or not: skip if - RDD tests - tests could run long or complicated (streaming, hivecontext) - tests on error conditions - tests won't likely change/break ## How was this patch tested? unit tests, `R CMD check --as-cran`, `R CMD check` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17817 from felixcheung/rskiptest.

… spark docker image ### What changes were proposed in this pull request? The pr aims to update the packages name removed in building the spark docker image. ### Why are the changes needed? When our default image base was switched from `ubuntu 20.04` to `ubuntu 22.04`, the unused installation package in the base image has changed, in order to eliminate some warnings in building images and free disk space more accurately, we need to correct it. Before: ``` #35 [29/31] RUN apt-get remove --purge -y '^aspnet.*' '^dotnet-.*' '^llvm-.*' 'php.*' '^mongodb-.*' snapd google-chrome-stable microsoft-edge-stable firefox azure-cli google-cloud-sdk mono-devel powershell libgl1-mesa-dri || true #35 0.489 Reading package lists... #35 0.505 Building dependency tree... #35 0.507 Reading state information... #35 0.511 E: Unable to locate package ^aspnet.* #35 0.511 E: Couldn't find any package by glob '^aspnet.*' #35 0.511 E: Couldn't find any package by regex '^aspnet.*' #35 0.511 E: Unable to locate package ^dotnet-.* #35 0.511 E: Couldn't find any package by glob '^dotnet-.*' #35 0.511 E: Couldn't find any package by regex '^dotnet-.*' #35 0.511 E: Unable to locate package ^llvm-.* #35 0.511 E: Couldn't find any package by glob '^llvm-.*' #35 0.511 E: Couldn't find any package by regex '^llvm-.*' #35 0.511 E: Unable to locate package ^mongodb-.* #35 0.511 E: Couldn't find any package by glob '^mongodb-.*' #35 0.511 EPackage 'php-crypt-gpg' is not installed, so not removed #35 0.511 Package 'php' is not installed, so not removed #35 0.511 : Couldn't find any package by regex '^mongodb-.*' #35 0.511 E: Unable to locate package snapd #35 0.511 E: Unable to locate package google-chrome-stable #35 0.511 E: Unable to locate package microsoft-edge-stable #35 0.511 E: Unable to locate package firefox #35 0.511 E: Unable to locate package azure-cli #35 0.511 E: Unable to locate package google-cloud-sdk #35 0.511 E: Unable to locate package mono-devel #35 0.511 E: Unable to locate package powershell #35 DONE 0.5s #36 [30/31] RUN apt-get autoremove --purge -y #36 0.063 Reading package lists... #36 0.079 Building dependency tree... #36 0.082 Reading state information... #36 0.088 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded. #36 DONE 0.4s ``` After: ``` #38 [32/36] RUN apt-get remove --purge -y 'gfortran-11' 'humanity-icon-theme' 'nodejs-doc' || true #38 0.066 Reading package lists... #38 0.087 Building dependency tree... #38 0.089 Reading state information... #38 0.094 The following packages were automatically installed and are no longer required: #38 0.094 at-spi2-core bzip2-doc dbus-user-session dconf-gsettings-backend #38 0.095 dconf-service gsettings-desktop-schemas gtk-update-icon-cache #38 0.095 hicolor-icon-theme libatk-bridge2.0-0 libatk1.0-0 libatk1.0-data #38 0.095 libatspi2.0-0 libbz2-dev libcairo-gobject2 libcolord2 libdconf1 libepoxy0 #38 0.095 libgfortran-11-dev libgtk-3-common libjs-highlight.js libllvm11 #38 0.095 libncurses-dev libncurses5-dev libphobos2-ldc-shared98 libreadline-dev #38 0.095 librsvg2-2 librsvg2-common libvte-2.91-common libwayland-client0 #38 0.095 libwayland-cursor0 libwayland-egl1 libxdamage1 libxkbcommon0 #38 0.095 session-migration tilix-common xkb-data #38 0.095 Use 'apt autoremove' to remove them. #38 0.096 The following packages will be REMOVED: #38 0.096 adwaita-icon-theme* gfortran* gfortran-11* humanity-icon-theme* libgtk-3-0* #38 0.096 libgtk-3-bin* libgtkd-3-0* libvte-2.91-0* libvted-3-0* nodejs-doc* #38 0.096 r-base-dev* tilix* ubuntu-mono* #38 0.248 0 upgraded, 0 newly installed, 13 to remove and 0 not upgraded. #38 0.248 After this operation, 99.6 MB disk space will be freed. ... (Reading database ... 70597 files and directories currently installed.) #38 0.304 Removing r-base-dev (4.1.2-1ubuntu2) ... #38 0.319 Removing gfortran (4:11.2.0-1ubuntu1) ... #38 0.340 Removing gfortran-11 (11.4.0-1ubuntu1~22.04) ... #38 0.356 Removing tilix (1.9.4-2build1) ... #38 0.377 Removing libvted-3-0:amd64 (3.10.0-1ubuntu1) ... #38 0.392 Removing libvte-2.91-0:amd64 (0.68.0-1) ... #38 0.407 Removing libgtk-3-bin (3.24.33-1ubuntu2) ... #38 0.422 Removing libgtkd-3-0:amd64 (3.10.0-1ubuntu1) ... #38 0.436 Removing nodejs-doc (12.22.9~dfsg-1ubuntu3.4) ... #38 0.457 Removing libgtk-3-0:amd64 (3.24.33-1ubuntu2) ... #38 0.488 Removing ubuntu-mono (20.10-0ubuntu2) ... #38 0.754 Removing humanity-icon-theme (0.6.16) ... #38 1.362 Removing adwaita-icon-theme (41.0-1ubuntu1) ... #38 1.537 Processing triggers for libc-bin (2.35-0ubuntu3.6) ... #38 1.566 Processing triggers for mailcap (3.70+nmu1ubuntu1) ... #38 1.577 Processing triggers for libglib2.0-0:amd64 (2.72.4-0ubuntu2.2) ... (Reading database ... 56946 files and directories currently installed.) #38 1.645 Purging configuration files for libgtk-3-0:amd64 (3.24.33-1ubuntu2) ... #38 1.657 Purging configuration files for ubuntu-mono (20.10-0ubuntu2) ... #38 1.670 Purging configuration files for humanity-icon-theme (0.6.16) ... #38 1.682 Purging configuration files for adwaita-icon-theme (41.0-1ubuntu1) ... #38 DONE 1.7s #39 [33/36] RUN apt-get autoremove --purge -y #39 0.061 Reading package lists... #39 0.075 Building dependency tree... #39 0.077 Reading state information... #39 0.083 The following packages will be REMOVED: #39 0.083 at-spi2-core* bzip2-doc* dbus-user-session* dconf-gsettings-backend* #39 0.083 dconf-service* gsettings-desktop-schemas* gtk-update-icon-cache* #39 0.083 hicolor-icon-theme* libatk-bridge2.0-0* libatk1.0-0* libatk1.0-data* #39 0.083 libatspi2.0-0* libbz2-dev* libcairo-gobject2* libcolord2* libdconf1* #39 0.083 libepoxy0* libgfortran-11-dev* libgtk-3-common* libjs-highlight.js* #39 0.083 libllvm11* libncurses-dev* libncurses5-dev* libphobos2-ldc-shared98* #39 0.083 libreadline-dev* librsvg2-2* librsvg2-common* libvte-2.91-common* #39 0.083 libwayland-client0* libwayland-cursor0* libwayland-egl1* libxdamage1* #39 0.083 libxkbcommon0* session-migration* tilix-common* xkb-data* #39 0.231 0 upgraded, 0 newly installed, 36 to remove and 0 not upgraded. #39 0.231 After this operation, 124 MB disk space will be freed. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46258 from panbingkun/remove_packages_on_ubuntu. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…dundant SYSTEM password reset ### What changes were proposed in this pull request? This pull request improves the Oracle JDBC tests by skipping the redundant SYSTEM password reset. ### Why are the changes needed? These changes are necessary to clean up the Oracle JDBC tests. This pull request effectively reverts the modifications introduced in [SPARK-46592](https://issues.apache.org/jira/browse/SPARK-46592) and [PR apache#44594](apache#44594), which attempted to work around the sporadic occurrence of ORA-65048 and ORA-04021 errors by setting the Oracle parameter DDL_LOCK_TIMEOUT. As discussed in [issue #35](gvenzl/oci-oracle-free#35), setting DDL_LOCK_TIMEOUT did not resolve the issue. The root cause appears to be an Oracle bug or unwanted behavior related to the use of Pluggable Database (PDB) rather than the expected functionality of Oracle itself. Additionally, with [SPARK-48141](https://issues.apache.org/jira/browse/SPARK-48141), we have upgraded the Oracle version used in the tests to Oracle Free 23ai, version 23.4. This upgrade should help address some of the issues observed with the previous Oracle version. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This patch was tested using the existing test suite, with a particular focus on Oracle JDBC tests. The following steps were executed: ``` export ENABLE_DOCKER_INTEGRATION_TESTS=1 ./build/sbt -Pdocker-integration-tests "docker-integration-tests/testOnly org.apache.spark.sql.jdbc.OracleIntegrationSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46598 from LucaCanali/fixOracleIntegrationTests. Lead-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Kent Yao <yao@apache.org>

cloud-fan and others added 11 commits May 3, 2017 10:08

GulajavaMinistudio merged commit 76029a8 into GulajavaMinistudio:master May 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update upstream #35

Update upstream #35

GulajavaMinistudio commented May 4, 2017

Update upstream #35

Update upstream #35

Conversation

GulajavaMinistudio commented May 4, 2017

What changes were proposed in this pull request?

How was this patch tested?