Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge #228

Merged
merged 127 commits into from
Jul 14, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
127 commits
Select commit Hold shift + click to select a range
49d767d
[SPARK-18710][ML] Add offset in GLM
actuaryzhang Jun 30, 2017
3c2fc19
[SPARK-18294][CORE] Implement commit protocol to support `mapred` pac…
jiangxb1987 Jun 30, 2017
528c928
[ML] Fix scala-2.10 build failure of GeneralizedLinearRegressionSuite.
yanboliang Jun 30, 2017
1fe08d6
[SPARK-21223] Change fileToAppInfo in FsHistoryProvider to fix concur…
Jun 30, 2017
eed9c4e
[SPARK-21129][SQL] Arguments of SQL function call should not be named…
gatorsmile Jun 30, 2017
fd13255
[SPARK-21052][SQL][FOLLOW-UP] Add hash map metrics to join
viirya Jun 30, 2017
4eb4187
[SPARK-17528][SQL] data should be copied properly before saving into …
cloud-fan Jul 1, 2017
61b5df5
[SPARK-21127][SQL] Update statistics after data changing commands
wzhfy Jul 1, 2017
b1d719e
[SPARK-21273][SQL] Propagate logical plan stats using visitor pattern…
rxin Jul 1, 2017
37ef32e
[SPARK-21275][ML] Update GLM test to use supportedFamilyNames
actuaryzhang Jul 1, 2017
e0b047e
[SPARK-18518][ML] HasSolver supports override
zhengruifeng Jul 1, 2017
6beca9c
[SPARK-21170][CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throw…
Jul 1, 2017
c605fee
[SPARK-21260][SQL][MINOR] Remove the unused OutputFakerExec
jiangxb1987 Jul 2, 2017
c19680b
[SPARK-19852][PYSPARK][ML] Python StringIndexer supports 'keep' to ha…
yanboliang Jul 2, 2017
d410719
[SPARK-18004][SQL] Make sure the date or timestamp related predicate …
SharpRay Jul 3, 2017
d913db1
[SPARK-21250][WEB-UI] Add a url in the table of 'Running Executors' i…
Jul 3, 2017
a9339db
[SPARK-21137][CORE] Spark reads many small files slowly
srowen Jul 3, 2017
eb7a5a6
[TEST] Load test table based on case sensitivity
wzhfy Jul 3, 2017
17bdc36
[SPARK-21102][SQL] Refresh command is too aggressive in parsing
Jul 3, 2017
363bfe3
[SPARK-20073][SQL] Prints an explicit warning message in case of NULL…
maropu Jul 3, 2017
f953ca5
[SPARK-21284][SQL] rename SessionCatalog.registerFunction parameter name
cloud-fan Jul 3, 2017
c79c10e
[TEST] Different behaviors of SparkContext Conf when building SparkSe…
gatorsmile Jul 3, 2017
6657e00
[SPARK-21283][CORE] FileOutputStream should be created as append mode
10110346 Jul 4, 2017
a848d55
[SPARK-21264][PYTHON] Call cross join path in join without 'on' and w…
HyukjinKwon Jul 4, 2017
8ca4ebe
[MINOR] Add french stop word "les"
ebuildy Jul 4, 2017
2b1e94b
[MINOR][SPARK SUBMIT] Print out R file usage in spark-submit
HyukjinKwon Jul 4, 2017
d492cc5
[SPARK-19507][SPARK-21296][PYTHON] Avoid per-record type dispatch in …
HyukjinKwon Jul 4, 2017
29b1f6b
[SPARK-21256][SQL] Add withSQLConf to Catalyst Test
gatorsmile Jul 4, 2017
a3c29fc
[SPARK-19726][SQL] Faild to insert null timestamp value to mysql usin…
Jul 4, 2017
1b50e0e
[SPARK-20256][SQL] SessionState should be created more lazily
dongjoon-hyun Jul 4, 2017
4d6d819
[SPARK-21268][MLLIB] Move center calculations to a distributed map in…
gjgd Jul 4, 2017
cec3921
[SPARK-20889][SPARKR] Grouped documentation for WINDOW column methods
actuaryzhang Jul 4, 2017
daabf42
[MINOR][SPARKR] ignore Rplots.pdf test output after running R tests
wangmiao1981 Jul 4, 2017
de14086
[SPARK-21295][SQL] Use qualified names in error message for missing r…
gatorsmile Jul 5, 2017
ce10545
[SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key pr…
ueshin Jul 5, 2017
e9a93f8
[SPARK-20889][SPARKR][FOLLOWUP] Clean up grouped doc for column methods
actuaryzhang Jul 5, 2017
f2c3b1d
[SPARK-21304][SQL] remove unnecessary isNull variable for collection …
cloud-fan Jul 5, 2017
a386432
[SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke` and modify …
ueshin Jul 5, 2017
4852b7d
[SPARK-21310][ML][PYSPARK] Expose offset in PySpark
actuaryzhang Jul 5, 2017
873f3ad
[SPARK-16167][SQL] RowEncoder should preserve array/map type nullabil…
ueshin Jul 5, 2017
5787ace
[SPARK-20383][SQL] Supporting Create [temporary] Function with the ke…
Jul 5, 2017
e3e2b5d
[SPARK-21286][TEST] Modified StorageTabSuite unit test
Geek-He Jul 5, 2017
960298e
[SPARK-20858][DOC][MINOR] Document ListenerBus event queue size
sadikovi Jul 5, 2017
742da08
[SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Sup…
zjffdu Jul 5, 2017
c8e7f44
[SPARK-21307][SQL] Remove SQLConf parameters from the parser-related …
gatorsmile Jul 5, 2017
c8d0aba
[SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6
dongjoon-hyun Jul 5, 2017
ab866f1
[SPARK-21248][SS] The clean up codes in StreamExecution should not be…
zsxwing Jul 6, 2017
75b168f
[SPARK-21308][SQL] Remove SQLConf parameters from the optimizer
gatorsmile Jul 6, 2017
14a3bb3
[SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream
Jul 6, 2017
60043f2
[SS][MINOR] Fix flaky test in DatastreamReaderWriterSuite. temp check…
tdas Jul 6, 2017
5800144
[SPARK-21012][SUBMIT] Add glob support for resources adding to Spark
jerryshao Jul 6, 2017
6ff05a6
[SPARK-20703][SQL] Associate metrics with data writes onto DataFrameW…
viirya Jul 6, 2017
b8e4d56
[SPARK-21324][TEST] Improve statistics test suites
wzhfy Jul 6, 2017
d540dfb
[SPARK-21273][SQL][FOLLOW-UP] Add missing test cases back and revise …
gengliangwang Jul 6, 2017
565e7a8
[SPARK-20950][CORE] add a new config to diskWriteBufferSize which is …
heary-cao Jul 6, 2017
26ac085
[SPARK-21228][SQL] InSet incorrect handling of structs
bogdanrdc Jul 6, 2017
48e44b2
[SPARK-21204][SQL] Add support for Scala Set collection types in seri…
viirya Jul 6, 2017
bf66335
[SPARK-21323][SQL] Rename plans.logical.statsEstimation.Range to Valu…
gengliangwang Jul 6, 2017
0217dfd
[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation
tdas Jul 7, 2017
40c7add
[SPARK-20946][SQL] Do not update conf for existing SparkContext in Sp…
cloud-fan Jul 7, 2017
e5bb261
[SPARK-21329][SS] Make EventTimeWatermarkExec explicitly UnaryExecNode
jaceklaskowski Jul 7, 2017
d451b7f
[SPARK-21326][SPARK-21066][ML] Use TextFileFormat in LibSVMFileFormat…
HyukjinKwon Jul 7, 2017
53c2eb5
[SPARK-21327][SQL][PYSPARK] ArrayConstructor should handle an array o…
ueshin Jul 7, 2017
c09b31e
[SPARK-21217][SQL] Support ColumnVector.Array.to<type>Array()
kiszk Jul 7, 2017
5df99bd
[SPARK-20703][SQL][FOLLOW-UP] Associate metrics with data writes onto…
viirya Jul 7, 2017
7fcbb9b
[SPARK-21313][SS] ConsoleSink's string representation
jaceklaskowski Jul 7, 2017
56536e9
[SPARK-21285][ML] VectorAssembler reports the column name of unsuppor…
facaiy Jul 7, 2017
fef0813
[SPARK-21335][SQL] support un-aliased subquery
cloud-fan Jul 7, 2017
fbbe37e
[SPARK-19358][CORE] LiveListenerBus shall log the event name when dro…
CodingCat Jul 7, 2017
a0fe32a
[SPARK-21336] Revise rand comparison in BatchEvalPythonExecSuite
gengliangwang Jul 7, 2017
e1a172c
[SPARK-21100][SQL] Add summary method as alternative to describe that…
aray Jul 8, 2017
7896e7b
[SPARK-21281][SQL] Use string types by default if array and map have …
maropu Jul 8, 2017
9760c15
[SPARK-20379][CORE] Allow SSL config to reference env variables.
Jul 8, 2017
d0bfc67
[SPARK-21069][SS][DOCS] Add rate source to programming guide.
ScrapCodes Jul 8, 2017
a7b46c6
[SPARK-20307][SPARKR] SparkR: pass on setHandleInvalid to spark.mllib…
wangmiao1981 Jul 8, 2017
f5f02d2
[SPARK-20456][DOCS] Add examples for functions collection for pyspark
map222 Jul 8, 2017
01f183e
Mesos doc fixes
Jul 8, 2017
330bf5c
[SPARK-20609][MLLIB][TEST] manually cleared 'spark.local.dir' before/…
heary-cao Jul 8, 2017
0b8dd2d
[SPARK-21345][SQL][TEST][TEST-MAVEN] SparkSessionBuilderSuite should …
dongjoon-hyun Jul 8, 2017
9fccc36
[SPARK-21083][SQL] Store zero size and row count when analyzing empty…
wzhfy Jul 8, 2017
9131bdb
[SPARK-20342][CORE] Update task accumulators before sending task end …
Jul 8, 2017
062c336
[SPARK-21343] Refine the document for spark.reducer.maxReqSizeShuffle…
Jul 8, 2017
c3712b7
[SPARK-21307][REVERT][SQL] Remove SQLConf parameters from the parser-…
gatorsmile Jul 8, 2017
08e0d03
[SPARK-21093][R] Terminate R's worker processes in the parent of R's …
HyukjinKwon Jul 8, 2017
680b33f
[SPARK-18016][SQL][FOLLOWUP] merge declareAddedFunctions, initNestedC…
cloud-fan Jul 9, 2017
457dc9c
[MINOR][DOC] Improve the docs about how to correctly set configurations
jerryshao Jul 10, 2017
0e80eca
[SPARK-21100][SQL][FOLLOWUP] cleanup code and add more comments for D…
cloud-fan Jul 10, 2017
96d58f2
[SPARK-21219][CORE] Task retry occurs on same executor due to race co…
Jul 10, 2017
c444d10
[MINOR][DOC] Remove obsolete `ec2-scripts.md`
dongjoon-hyun Jul 10, 2017
647963a
[SPARK-20460][SQL] Make it more consistent to handle column name dupl…
maropu Jul 10, 2017
6a06c4b
[SPARK-21342] Fix DownloadCallback to work well with RetryingBlockFet…
Jul 10, 2017
18b3b00
[SPARK-21272] SortMergeJoin LeftAnti does not update numOutputRows
juliuszsompolski Jul 10, 2017
2bfd5ac
[SPARK-21266][R][PYTHON] Support schema a DDL-formatted string in dap…
HyukjinKwon Jul 10, 2017
d03aebb
[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of …
BryanCutler Jul 10, 2017
c3713fd
[SPARK-21358][EXAMPLES] Argument of repartitionandsortwithinpartition…
chie8842 Jul 11, 2017
a2bec6c
[SPARK-21043][SQL] Add unionByName in Dataset
maropu Jul 11, 2017
1471ee7
[SPARK-21350][SQL] Fix the error message when the number of arguments…
gatorsmile Jul 11, 2017
833eab2
[SPARK-21369][CORE] Don't use Scala Tuple2 in common/network-*
zsxwing Jul 11, 2017
97a1aa2
[SPARK-21315][SQL] Skip some spill files when generateIterator(startI…
Jul 11, 2017
d4d9e17
[SPARK-20456][PYTHON][FOLLOWUP] Fix timezone-dependent doctests in un…
HyukjinKwon Jul 11, 2017
a4baa8f
[SPARK-20331][SQL] Enhanced Hive partition pruning predicate pushdown
Jul 11, 2017
7514db1
[SPARK-21263][SQL] Do not allow partially parsing double and floats v…
HyukjinKwon Jul 11, 2017
66d2168
[SPARK-21366][SQL][TEST] Add sql test for window functions
jiangxb1987 Jul 11, 2017
ebc124d
[SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema defi…
HyukjinKwon Jul 11, 2017
1cad31f
[SPARK-16019][YARN] Use separate RM poll interval when starting clien…
Jul 11, 2017
d3e0716
[SPARK-19285][SQL] Implement UDF0
gatorsmile Jul 11, 2017
2cbfc97
[SPARK-12139][SQL] REGEX Column Specification
janewangfb Jul 12, 2017
24367f2
[SPARK-21382] The note about Scala 2.10 in building-spark.md is wrong.
liu-zhaokun Jul 12, 2017
e16e8c7
[SPARK-21146][CORE] Master/Worker should handle and shutdown when any…
Jul 12, 2017
e0af76a
[SPARK-21370][SS] Add test for state reliability when one read-only s…
brkyvz Jul 12, 2017
f587d2e
[SPARK-20842][SQL] Upgrade to 1.2.2 for Hive Metastore Client 1.2
gatorsmile Jul 12, 2017
5ed134e
[SPARK-21305][ML][MLLIB] Add options to disable multi-threading of na…
Jul 12, 2017
aaad34d
[SPARK-21007][SQL] Add SQL function - RIGHT && LEFT
10110346 Jul 12, 2017
d2d2a5d
[SPARK-18619][ML] Make QuantileDiscretizer/Bucketizer/StringIndexer/R…
zhengruifeng Jul 12, 2017
780586a
[SPARK-17701][SQL] Refactor RowDataSourceScanExec so its sameResult c…
cloud-fan Jul 12, 2017
e08d06b
[SPARK-18646][REPL] Set parent classloader as null for ExecutorClassL…
taroplus Jul 13, 2017
425c4ad
[SPARK-19810][BUILD][CORE] Remove support for Scala 2.10
srowen Jul 13, 2017
af80e01
[SPARK-21373][CORE] Update Jetty to 9.3.20.v20170531
kiszk Jul 13, 2017
d8257b9
[SPARK-21403][MESOS] fix --packages for mesos
skonto Jul 13, 2017
5c8edfc
[SPARK-15526][MLLIB] Shade JPMML
srowen Jul 13, 2017
aa2e951
Merge branch 'master' into rk/upstream
Jul 13, 2017
2ca1ed9
resolve conflicts
Jul 13, 2017
cb8d5cc
[SPARK-21376][YARN] Fix yarn client token expire issue when cleaning …
jerryshao Jul 13, 2017
1fe3936
Merge branch 'master' into rk/upstream
Jul 14, 2017
9c973e4
linting
Jul 14, 2017
0e0cc0a
update dependencies
Jul 14, 2017
d56829a
py4j in conda version
Jul 14, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ R-unit-tests.log
R/unit-tests.out
R/cran-check.out
R/pkg/vignettes/sparkr-vignettes.html
R/pkg/tests/fulltests/Rplots.pdf
build/*.jar
build/apache-maven*
build/scala*
Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ The text of each license is also included at licenses/LICENSE-[project].txt.
(New BSD license) Protocol Buffer Java API (org.spark-project.protobuf:protobuf-java:2.4.1-shaded - http://code.google.com/p/protobuf)
(The BSD License) Fortran to Java ARPACK (net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net)
(The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - http://xmlenc.sourceforge.net)
(The New BSD License) Py4J (net.sf.py4j:py4j:0.10.4 - http://py4j.sourceforge.net/)
(The New BSD License) Py4J (net.sf.py4j:py4j:0.10.6 - http://py4j.sourceforge.net/)
(Two-clause BSD-style license) JUnit-Interface (com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/)
(BSD licence) sbt and sbt-launch-lib.bash
(BSD 3 Clause) d3.min.js (https://github.com/mbostock/d3/blob/master/LICENSE)
Expand Down
2 changes: 2 additions & 0 deletions R/pkg/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -429,6 +429,7 @@ export("structField",
"structField.character",
"print.structField",
"structType",
"structType.character",
"structType.jobj",
"structType.structField",
"print.structType")
Expand Down Expand Up @@ -465,5 +466,6 @@ S3method(print, summary.GBTRegressionModel)
S3method(print, summary.GBTClassificationModel)
S3method(structField, character)
S3method(structField, jobj)
S3method(structType, character)
S3method(structType, jobj)
S3method(structType, structField)
36 changes: 32 additions & 4 deletions R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -1391,6 +1391,10 @@ setMethod("summarize",
})

dapplyInternal <- function(x, func, schema) {
if (is.character(schema)) {
schema <- structType(schema)
}

packageNamesArr <- serialize(.sparkREnv[[".packages"]],
connection = NULL)

Expand All @@ -1408,6 +1412,8 @@ dapplyInternal <- function(x, func, schema) {
dataFrame(sdf)
}

setClassUnion("characterOrstructType", c("character", "structType"))

#' dapply
#'
#' Apply a function to each partition of a SparkDataFrame.
Expand All @@ -1418,10 +1424,11 @@ dapplyInternal <- function(x, func, schema) {
#' to each partition will be passed.
#' The output of func should be a R data.frame.
#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
#' It must match the output of func.
#' It must match the output of func. Since Spark 2.3, the DDL-formatted string
#' is also supported for the schema.
#' @family SparkDataFrame functions
#' @rdname dapply
#' @aliases dapply,SparkDataFrame,function,structType-method
#' @aliases dapply,SparkDataFrame,function,characterOrstructType-method
#' @name dapply
#' @seealso \link{dapplyCollect}
#' @export
Expand All @@ -1444,6 +1451,17 @@ dapplyInternal <- function(x, func, schema) {
#' y <- cbind(y, y[1] + 1L)
#' },
#' schema)
#'
#' # The schema also can be specified in a DDL-formatted string.
#' schema <- "a INT, d DOUBLE, c STRING, d INT"
#' df1 <- dapply(
#' df,
#' function(x) {
#' y <- x[x[1] > 1, ]
#' y <- cbind(y, y[1] + 1L)
#' },
#' schema)
#'
#' collect(df1)
#' # the result
#' # a b c d
Expand All @@ -1452,7 +1470,7 @@ dapplyInternal <- function(x, func, schema) {
#' }
#' @note dapply since 2.0.0
setMethod("dapply",
signature(x = "SparkDataFrame", func = "function", schema = "structType"),
signature(x = "SparkDataFrame", func = "function", schema = "characterOrstructType"),
function(x, func, schema) {
dapplyInternal(x, func, schema)
})
Expand Down Expand Up @@ -1522,6 +1540,7 @@ setMethod("dapplyCollect",
#' @param schema the schema of the resulting SparkDataFrame after the function is applied.
#' The schema must match to output of \code{func}. It has to be defined for each
#' output column with preferred output column name and corresponding data type.
#' Since Spark 2.3, the DDL-formatted string is also supported for the schema.
#' @return A SparkDataFrame.
#' @family SparkDataFrame functions
#' @aliases gapply,SparkDataFrame-method
Expand All @@ -1541,7 +1560,7 @@ setMethod("dapplyCollect",
#'
#' Here our output contains three columns, the key which is a combination of two
#' columns with data types integer and string and the mean which is a double.
#' schema <- structType(structField("a", "integer"), structField("c", "string"),
#' schema <- structType(structField("a", "integer"), structField("c", "string"),
#' structField("avg", "double"))
#' result <- gapply(
#' df,
Expand All @@ -1550,6 +1569,15 @@ setMethod("dapplyCollect",
#' y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
#' }, schema)
#'
#' The schema also can be specified in a DDL-formatted string.
#' schema <- "a INT, c STRING, avg DOUBLE"
#' result <- gapply(
#' df,
#' c("a", "c"),
#' function(key, x) {
#' y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
#' }, schema)
#'
#' We can also group the data and afterwards call gapply on GroupedData.
#' For Example:
#' gdf <- group_by(df, "a", "c")
Expand Down
Loading