[SPARK-25124][ML]VectorSizeHint setSize and getSize don't return values backport to 2.3 #22228

huaxingao · 2018-08-24T21:14:46Z

What changes were proposed in this pull request?

In feature.py, VectorSizeHint setSize and getSize don't return value. Add return.

(Please fill in changes proposed in this fix)

How was this patch tested?

Unit Test added

…es backport to 2.3

huaxingao · 2018-08-24T21:16:09Z

@jkbradley backport to 2.3.

SparkQA · 2018-08-24T21:40:43Z

Test build #95230 has finished for PR 22228 at commit 35092f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-08-24T22:40:02Z

Awesome, thank you!
LGTM
Merging with branch-2.3

…ues backport to 2.3 ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value. Add return. (Please fill in changes proposed in this fix) ## How was this patch tested? Unit Test added Closes #22228 from huaxingao/spark-25124-2.3. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

* [PYSPARK] Updates to Accumulators (cherry picked from commit 15fc237) * [SPARK-24987][SS] - Fix Kafka consumer leak when no new offsets for TopicPartition ## What changes were proposed in this pull request? This small fix adds a `consumer.release()` call to `KafkaSourceRDD` in the case where we've retrieved offsets from Kafka, but the `fromOffset` is equal to the `lastOffset`, meaning there is no new data to read for a particular topic partition. Up until now, we'd just return an empty iterator without closing the consumer which would cause a FD leak. If accepted, this pull request should be merged into master as well. ## How was this patch tested? Haven't ran any specific tests, would love help on how to test methods running inside `RDD.compute`. Author: Yuval Itzchakov <yuval.itzchakov@clicktale.com> Closes apache#21997 from YuvalItzchakov/master. (cherry picked from commit b7fdf8e) Signed-off-by: cody koeninger <cody@koeninger.org> * [SPARK-25015][BUILD] Update Hadoop 2.7 to 2.7.7 ## What changes were proposed in this pull request? Update Hadoop 2.7 to 2.7.7 to pull in bug and security fixes. ## How was this patch tested? Existing tests. Author: Sean Owen <srowen@gmail.com> Closes apache#21987 from srowen/SPARK-25015. (cherry picked from commit 5f9633d) Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-24948][SHS][BACKPORT-2.3] Delegate check access permissions to the file system ## What changes were proposed in this pull request? In `SparkHadoopUtil. checkAccessPermission`, we consider only basic permissions in order to check whether a user can access a file or not. This is not a complete check, as it ignores ACLs and other policies a file system may apply in its internal. So this can result in returning wrongly that a user cannot access a file (despite he actually can). The PR proposes to delegate to the filesystem the check whether a file is accessible or not, in order to return the right result. A caching layer is added for performance reasons. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#22021 from mgaido91/SPARK-24948_2.3. * [MINOR][BUILD] Update Jetty to 9.3.24.v20180605 Update Jetty to 9.3.24.v20180605 to pick up security fix Existing tests. Closes apache#22055 from srowen/Jetty9324. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit eb9a696) Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-25076][SQL] SQLConf should not be retrieved from a stopped SparkSession ## What changes were proposed in this pull request? When a `SparkSession` is stopped, `SQLConf.get` should use the fallback conf to avoid weird issues like ``` sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is stopped. at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) at org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:93) at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) at scala.Option.getOrElse(Option.scala:121) ... ``` ## How was this patch tested? a new test suite Closes apache#22056 from cloud-fan/session. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com> (cherry picked from commit fec67ed) Signed-off-by: Xiao Li <gatorsmile@gmail.com> * [SPARK-24950][SQL] DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13 ## What changes were proposed in this pull request? - Update DateTimeUtilsSuite so that when testing roundtripping in daysToMillis and millisToDays multiple skipdates can be specified. - Updated test so that both new years eve 2014 and new years day 2015 are skipped for kiribati time zones. This is necessary as java versions pre 181-b13 considered new years day 2015 to be skipped while susequent versions corrected this to new years eve. ## How was this patch tested? Unit tests Author: Chris Martin <chris@cmartinit.co.uk> Closes apache#21901 from d80tb7/SPARK-24950_datetimeUtilsSuite_failures. (cherry picked from commit c5b8d54) Signed-off-by: Sean Owen <srowen@gmail.com> * Preparing Spark release v2.3.2-rc4 * Preparing development version 2.3.3-SNAPSHOT * [MINOR][BUILD] Add ECCN notice required by http://www.apache.org/dev/crypto.html Add ECCN notice required by http://www.apache.org/dev/crypto.html See https://issues.apache.org/jira/browse/LEGAL-398 This should probably be backported to 2.3, 2.2, as that's when the key dep (commons crypto) turned up. BC is actually unused, but still there. N/A Closes apache#22064 from srowen/ECCN. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 91cdab5) Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-25081][CORE] Nested spill in ShuffleExternalSorter should not access released memory page ## What changes were proposed in this pull request? This issue is pretty similar to [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907). "allocateArray" in [ShuffleInMemorySorter.reset](https://github.com/apache/spark/blob/9b8521e53e56a53b44c02366a99f8a8ee1307bbf/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java#L99) may trigger a spill and cause ShuffleInMemorySorter access the released `array`. Another task may get the same memory page from the pool. This will cause two tasks access the same memory page. When a task reads memory written by another task, many types of failures may happen. Here are some examples I have seen: - JVM crash. (This is easy to reproduce in a unit test as we fill newly allocated and deallocated memory with 0xa5 and 0x5a bytes which usually points to an invalid memory address) - java.lang.IllegalArgumentException: Comparison method violates its general contract! - java.lang.NullPointerException at org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:384) - java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size -536870912 because the size after growing exceeds size limitation 2147483632 This PR resets states in `ShuffleInMemorySorter.reset` before calling `allocateArray` to fix the issue. ## How was this patch tested? The new unit test will make JVM crash without the fix. Closes apache#22062 from zsxwing/SPARK-25081. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit f5aba65) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> * [SPARK-24908][R][STYLE] removing spaces to make lintr happy ## What changes were proposed in this pull request? during my travails in porting spark builds to run on our centos worker, i managed to recreate (as best i could) the centos environment on our new ubuntu-testing machine. while running my initial builds, lintr was crashing on some extraneous spaces in test_basic.R (see: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console) after removing those spaces, the ubuntu build happily passed the lintr tests. ## How was this patch tested? i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build (see https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/), which scp'ed a copy of test_basic.R in to the repo after the git clone. everything seems to be working happily. Author: shane knapp <incomplete@gmail.com> Closes apache#21864 from shaneknapp/fixing-R-lint-spacing. (cherry picked from commit 3efdf35) Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-25084][SQL][BACKPORT-2.3] distribute by" on multiple columns (wrap in brackets) may lead to codegen issue ## What changes were proposed in this pull request? Backport apache#22066 to branch-2.3 Use different API in 2.3 here ```scala |${ctx.JAVA_INT} $childResult = 0; ``` "distribute by" on multiple columns (wrap in brackets) may lead to codegen issue. Simple way to reproduce: ```scala val df = spark.range(1000) val columns = (0 until 400).map{ i => s"id as id$i" } val distributeExprs = (0 until 100).map(c => s"id$c").mkString(",") df.selectExpr(columns : _*).createTempView("test") spark.sql(s"select * from test distribute by ($distributeExprs)").count() ``` ## How was this patch tested? UT in Jenkins Closes apache#22077 from LantaoJin/SPARK-25084_2.3. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values ## What changes were proposed in this pull request? `ANALYZE TABLE ... PARTITION(...) COMPUTE STATISTICS` can fail with a NPE if a partition column contains a NULL value. The PR avoids the NPE, replacing the `NULL` values with the default partition placeholder. ## How was this patch tested? added UT Closes apache#22036 from mgaido91/SPARK-25028. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c220cc4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> * Preparing Spark release v2.3.2-rc5 * Preparing development version 2.3.3-SNAPSHOT * [MINOR][SQL][DOC] Fix `to_json` example in function description and doc ## What changes were proposed in this pull request? This PR fixes the an example for `to_json` in doc and function description. - http://spark.apache.org/docs/2.3.0/api/sql/#to_json - `describe function extended` ## How was this patch tested? Pass the Jenkins with the updated test. Closes apache#22096 from dongjoon-hyun/minor_json. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit e2ab7de) Signed-off-by: hyukjinkwon <gurwls223@apache.org> * [SPARK-25051][SQL] FixNullability should not stop on AnalysisBarrier ## What changes were proposed in this pull request? The introduction of `AnalysisBarrier` prevented `FixNullability` to go through all the nodes. This introduced a bug, which can lead to wrong results, as the nullability of the output attributes of an outer join can be wrong. The PR makes `FixNullability` going through the `AnalysisBarrier`s. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#22102 from mgaido91/SPARK-25051. * [MINOR][DOC][SQL] use one line for annotation arg value ## What changes were proposed in this pull request? Put annotation args in one line, or API doc generation will fail. ~~~ [error] /Users/meng/src/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:1559: annotation argument needs to be a constant; found: "_FUNC_(expr) - Returns the character length of string data or number of bytes of ".+("binary data. The length of string data includes the trailing spaces. The length of binary ").+("data includes binary zeros.") [error] "binary data. The length of string data includes the trailing spaces. The length of binary " + [error] ^ [info] No documentation generated with unsuccessful compiler run [error] one error found [error] (catalyst/compile:doc) Scaladoc generation failed [error] Total time: 27 s, completed Aug 17, 2018 3:20:08 PM ~~~ ## How was this patch tested? sbt catalyst/compile:doc passed Closes apache#22137 from mengxr/minor-doc-fix. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit f454d52) Signed-off-by: hyukjinkwon <gurwls223@apache.org> * [SPARK-25144][SQL][TEST][BRANCH-2.3] Free aggregate map when task ends ## What changes were proposed in this pull request? [SPARK-25144](https://issues.apache.org/jira/browse/SPARK-25144) reports memory leaks on Apache Spark 2.0.2 ~ 2.3.2-RC5. ```scala scala> case class Foo(bar: Option[String]) scala> val ds = List(Foo(Some("bar"))).toDS scala> val result = ds.flatMap(_.bar).distinct scala> result.rdd.isEmpty 18/08/19 23:01:54 WARN Executor: Managed memory leak detected; size = 8650752 bytes, TID = 125 res0: Boolean = false ``` This is a backport of cloud-fan 's apache#21738 which is a single commit among 3 commits of SPARK-21743. In addition, I added a test case to prevent regressions in branch-2.3 and branch-2.2. Although SPARK-21743 is reverted due to regression, this subpatch can go to branch-2.3 and branch-2.2. This will be merged as cloud-fan 's commit. ## How was this patch tested? Pass the jenkins with a newly added test case. Closes apache#22150 from dongjoon-hyun/SPARK-25144. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org> * [DOCS] Fixed NDCG formula issues When j is 0, log(j+1) will be 0, and this leads to division by 0 issue. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22090 from yueguoguo/patch-1. Authored-by: Zhang Le <yueguoguo@users.noreply.github.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 219ed7b) Signed-off-by: Sean Owen <sean.owen@databricks.com> * [SPARK-25114][CORE] Fix RecordBinaryComparator when subtraction between two words is divisible by Integer.MAX_VALUE. apache#22079 (comment) It is possible for two objects to be unequal and yet we consider them as equal with this code, if the long values are separated by Int.MaxValue. This PR fixes the issue. Add new test cases in `RecordBinaryComparatorSuite`. Closes apache#22101 from jiangxb1987/fix-rbc. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com> (cherry picked from commit 4fb96e5) Signed-off-by: Xiao Li <gatorsmile@gmail.com> * [SPARK-25114][2.3][CORE][FOLLOWUP] Fix RecordBinaryComparatorSuite build failure ## What changes were proposed in this pull request? Fix RecordBinaryComparatorSuite build failure ## How was this patch tested? Existing tests. Closes apache#22166 from jiangxb1987/SPARK-25114-2.3. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com> * [SPARK-25205][CORE] Fix typo in spark.network.crypto.keyFactoryIterations Closes apache#22195 from squito/SPARK-25205. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 0ce09ec) Signed-off-by: hyukjinkwon <gurwls223@apache.org> * [SPARK-25234][SPARKR] avoid integer overflow in parallelize ## What changes were proposed in this pull request? `parallelize` uses integer multiplication to determine the split indices. It might cause integer overflow. ## How was this patch tested? unit test Closes apache#22225 from mengxr/SPARK-25234. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com> (cherry picked from commit 9714fa5) Signed-off-by: Xiangrui Meng <meng@databricks.com> * [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values backport to 2.3 ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value. Add return. (Please fill in changes proposed in this fix) ## How was this patch tested? Unit Test added Closes apache#22228 from huaxingao/spark-25124-2.3. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

wangyum · 2018-09-04T14:56:57Z

This is already merged, @huaxingao Could you please close this PR?

[SPARK-25124][ML]VectorSizeHint setSize and getSize don't return valu…

35092f0

…es backport to 2.3

huaxingao closed this Sep 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25124][ML]VectorSizeHint setSize and getSize don't return values backport to 2.3 #22228

[SPARK-25124][ML]VectorSizeHint setSize and getSize don't return values backport to 2.3 #22228

huaxingao commented Aug 24, 2018

huaxingao commented Aug 24, 2018

SparkQA commented Aug 24, 2018

jkbradley commented Aug 24, 2018

wangyum commented Sep 4, 2018

[SPARK-25124][ML]VectorSizeHint setSize and getSize don't return values backport to 2.3 #22228

[SPARK-25124][ML]VectorSizeHint setSize and getSize don't return values backport to 2.3 #22228

Conversation

huaxingao commented Aug 24, 2018

What changes were proposed in this pull request?

How was this patch tested?

huaxingao commented Aug 24, 2018

SparkQA commented Aug 24, 2018

jkbradley commented Aug 24, 2018

wangyum commented Sep 4, 2018