From b6c37ef5a28dfa0ed07f6ab8f154fcd43a2131af Mon Sep 17 00:00:00 2001 From: Gil Vernik Date: Sun, 8 Jun 2014 10:23:41 +0300 Subject: [PATCH 01/13] Openstack Swift support --- docs/openstack-integration.md | 83 +++++++++++++++++++++++++++++++++++ 1 file changed, 83 insertions(+) create mode 100644 docs/openstack-integration.md diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md new file mode 100644 index 0000000000000..42cd3067edf80 --- /dev/null +++ b/docs/openstack-integration.md @@ -0,0 +1,83 @@ +--- +layout: global +title: Accessing Openstack Swift storage from Spark +--- + +# Accessing Openstack Swift storage from Spark + +Spark's file interface allows it to process data in Openstack Swift using the same URI formats that are supported for Hadoop. You can specify a path in Swift as input through a URI of the form `swift:///path`. You will also need to set your Swift security credentials, through `SparkContext.hadoopConfiguration`. + +#Configuring Hadoop to use Openstack Swift +Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)) Users that wish to use previous Hadoop versions will need to configure Swift driver manually. +

Hadoop 2.3.0 and above.

+An Openstack Swift driver was merged into Haddop 2.3.0 . Current Hadoop driver requieres Swift to use Keystone authentication. There are additional efforts to support temp auth for Hadoop [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). +To configure Hadoop to work with Swift one need to modify core-sites.xml of Hadoop and setup Swift FS. + + + + fs.swift.impl + org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem + + + + +

Configuring Spark - stand alone cluster

+You need to configure the compute-classpath.sh and add Hadoop classpath for + + + CLASSPATH = /share/hadoop/common/lib/* + CLASSPATH = /share/hadoop/hdfs/* + CLASSPATH = /share/hadoop/tools/lib/* + CLASSPATH = /share/hadoop/hdfs/lib/* + CLASSPATH = /share/hadoop/mapreduce/* + CLASSPATH = /share/hadoop/mapreduce/lib/* + CLASSPATH = /share/hadoop/yarn/* + CLASSPATH = /share/hadoop/yarn/lib/* + +Additional parameters has to be provided to the Hadoop from Spark. Swift driver of Hadoop uses those parameters to perform authentication in Keystone needed to access Swift. +List of mandatory parameters is : `fs.swift.service..auth.url`, `fs.swift.service..auth.endpoint.prefix`, `fs.swift.service..tenant`, `fs.swift.service..username`, +`fs.swift.service..password`, `fs.swift.service..http.port`, `fs.swift.service..http.port`, `fs.swift.service..public`. +Create core-sites.xml and place it under /spark/conf directory. Configure core-sites.xml with general Keystone parameters, for example + + + + fs.swift.service..auth.url + http://127.0.0.1:5000/v2.0/tokens + + + fs.swift.service..auth.endpoint.prefix + endpoints + + fs.swift.service..http.port + 8080 + + + fs.swift.service..region + RegionOne + + + fs.swift.service..public + true + + +We left with `fs.swift.service..tenant`, `fs.swift.service..username`, `fs.swift.service..password`. The best way is to provide them to SparkContext in run time, which seems to be impossible yet. +Another approach is to change Hadoop Swift FS driver to provide them via system environment variables. For now we provide them via core-sites.xml + + + fs.swift.service..tenant + test + + + fs.swift.service..username + tester + + + fs.swift.service..password + testing + + +

Usage

+Assume you have a Swift container `logs` with an object `data.log`. You can use `swift://` scheme to access objects from Swift. + + val sfdata = sc.textFile("swift://logs./data.log") + From ce483d76a1d524800859764b967c8b5a98fbd9ea Mon Sep 17 00:00:00 2001 From: Gil Vernik Date: Sun, 8 Jun 2014 10:34:04 +0300 Subject: [PATCH 02/13] SPARK-938 - Openstack Swift object storage support This is initial documentation describing how to integrate Spark with Swift. This commit contains documentation for stand alone cluster. Next patches will contain details how to integrate Swift in other deployment of Spark. --- docs/openstack-integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md index 42cd3067edf80..ca422d298cc11 100644 --- a/docs/openstack-integration.md +++ b/docs/openstack-integration.md @@ -60,7 +60,7 @@ Create core-sites.xml and place it under /spark/conf directory. Configure core-s true
-We left with `fs.swift.service..tenant`, `fs.swift.service..username`, `fs.swift.service..password`. The best way is to provide them to SparkContext in run time, which seems to be impossible yet. +We left with `fs.swift.service..tenant`, `fs.swift.service..username`, `fs.swift.service..password`. The best way to provide those parameters to SparkContext in run time, which seems to be impossible yet. Another approach is to change Hadoop Swift FS driver to provide them via system environment variables. For now we provide them via core-sites.xml From eff538dd8fb7e306c84874e9b4c7da68fa0fe5d0 Mon Sep 17 00:00:00 2001 From: Gil Vernik Date: Sun, 8 Jun 2014 10:34:04 +0300 Subject: [PATCH 03/13] SPARK-938 - Openstack Swift object storage support Documentation how to integrate Spark with Openstack Swift. --- core/pom.xml | 6 +- docs/openstack-integration.md | 143 ++++++++++++++++++---------------- pom.xml | 11 +++ yarn/pom.xml | 4 + 4 files changed, 96 insertions(+), 68 deletions(-) diff --git a/core/pom.xml b/core/pom.xml index bab50f5ce2888..93dadafe57046 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -35,7 +35,11 @@ org.apache.hadoop hadoop-client - + + org.apache.hadoop + hadoop-openstack + + net.java.dev.jets3t jets3t diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md index 42cd3067edf80..07f22b3f12b13 100644 --- a/docs/openstack-integration.md +++ b/docs/openstack-integration.md @@ -8,76 +8,85 @@ title: Accessing Openstack Swift storage from Spark Spark's file interface allows it to process data in Openstack Swift using the same URI formats that are supported for Hadoop. You can specify a path in Swift as input through a URI of the form `swift:///path`. You will also need to set your Swift security credentials, through `SparkContext.hadoopConfiguration`. #Configuring Hadoop to use Openstack Swift -Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)) Users that wish to use previous Hadoop versions will need to configure Swift driver manually. -

Hadoop 2.3.0 and above.

-An Openstack Swift driver was merged into Haddop 2.3.0 . Current Hadoop driver requieres Swift to use Keystone authentication. There are additional efforts to support temp auth for Hadoop [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). +Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous Hadoop versions will need to configure Swift driver manually. Current Swift driver requieres Swift to use Keystone authentication method. There are recent efforts to support also temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). To configure Hadoop to work with Swift one need to modify core-sites.xml of Hadoop and setup Swift FS. - - - fs.swift.impl - org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem - - + + + fs.swift.impl + org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem + + +#Configuring Swift +Proxy server of Swift should include `list_endpoints` middleware. More information available [here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) -

Configuring Spark - stand alone cluster

-You need to configure the compute-classpath.sh and add Hadoop classpath for +#Configuring Spark +To use Swift driver, Spark need to be compiled with `hadoop-openstack-2.3.0.jar` distributted with Hadoop 2.3.0. +For the Maven builds, Spark's main pom.xml should include - - CLASSPATH = /share/hadoop/common/lib/* - CLASSPATH = /share/hadoop/hdfs/* - CLASSPATH = /share/hadoop/tools/lib/* - CLASSPATH = /share/hadoop/hdfs/lib/* - CLASSPATH = /share/hadoop/mapreduce/* - CLASSPATH = /share/hadoop/mapreduce/lib/* - CLASSPATH = /share/hadoop/yarn/* - CLASSPATH = /share/hadoop/yarn/lib/* - -Additional parameters has to be provided to the Hadoop from Spark. Swift driver of Hadoop uses those parameters to perform authentication in Keystone needed to access Swift. -List of mandatory parameters is : `fs.swift.service..auth.url`, `fs.swift.service..auth.endpoint.prefix`, `fs.swift.service..tenant`, `fs.swift.service..username`, -`fs.swift.service..password`, `fs.swift.service..http.port`, `fs.swift.service..http.port`, `fs.swift.service..public`. -Create core-sites.xml and place it under /spark/conf directory. Configure core-sites.xml with general Keystone parameters, for example - - - - fs.swift.service..auth.url - http://127.0.0.1:5000/v2.0/tokens - - - fs.swift.service..auth.endpoint.prefix - endpoints - - fs.swift.service..http.port - 8080 -
- - fs.swift.service..region - RegionOne - - - fs.swift.service..public - true - - -We left with `fs.swift.service..tenant`, `fs.swift.service..username`, `fs.swift.service..password`. The best way is to provide them to SparkContext in run time, which seems to be impossible yet. -Another approach is to change Hadoop Swift FS driver to provide them via system environment variables. For now we provide them via core-sites.xml - - - fs.swift.service..tenant - test - - - fs.swift.service..username - tester - - - fs.swift.service..password - testing - - -

Usage

-Assume you have a Swift container `logs` with an object `data.log`. You can use `swift://` scheme to access objects from Swift. - - val sfdata = sc.textFile("swift://logs./data.log") + 2.3.0 + + + + org.apache.hadoop + hadoop-openstack + ${swift.version} + + +in addition, pom.xml of the `core` and `yarn` projects should include + + + org.apache.hadoop + hadoop-openstack + + + +Additional parameters has to be provided to the Swift driver. Swift driver will use those parameters to perform authentication in Keystone prior accessing Swift. List of mandatory parameters is : `fs.swift.service..auth.url`, `fs.swift.service..auth.endpoint.prefix`, `fs.swift.service..tenant`, `fs.swift.service..username`, +`fs.swift.service..password`, `fs.swift.service..http.port`, `fs.swift.service..http.port`, `fs.swift.service..public`, where `PROVIDER` is any name. `fs.swift.service..auth.url` should point to the Keystone authentication URL. + +Create core-sites.xml with the mandatory parameters and place it under /spark/conf directory. For example: + + + + fs.swift.service..auth.url + http://127.0.0.1:5000/v2.0/tokens + + + fs.swift.service..auth.endpoint.prefix + endpoints + + fs.swift.service..http.port + 8080 +
+ + fs.swift.service..region + RegionOne + + + fs.swift.service..public + true + + +We left with `fs.swift.service..tenant`, `fs.swift.service..username`, `fs.swift.service..password`. The best way to provide those parameters to SparkContext in run time, which seems to be impossible yet. +Another approach is to adapt Swift driver to obtain those values from system environment variables. For now we provide them via core-sites.xml. +Assume a tenant `test` with user `tester` was defined in Keystone, then the core-sites.xml shoud include: + + + fs.swift.service..tenant + test + + + fs.swift.service..username + tester + + + fs.swift.service..password + testing + +# Usage +Assume there exists Swift container `logs` with an object `data.log`. To access `data.log` from Spark the `swift://` scheme should be used. +For example: + + val sfdata = sc.textFile("swift://logs./data.log") diff --git a/pom.xml b/pom.xml index 86264d1132ec4..79cf5fdc23d01 100644 --- a/pom.xml +++ b/pom.xml @@ -132,6 +132,7 @@ 3.0.0 1.7.6 0.7.1 + 2.3.0 64m 512m @@ -584,6 +585,11 @@ + + org.apache.hadoop + hadoop-openstack + ${swift.version} + org.apache.hadoop hadoop-yarn-api @@ -1024,6 +1030,11 @@ hadoop-client provided + + org.apache.hadoop + hadoop-openstack + provided + org.apache.hadoop hadoop-yarn-api diff --git a/yarn/pom.xml b/yarn/pom.xml index 6993c89525d8c..e58d8312f1a86 100644 --- a/yarn/pom.xml +++ b/yarn/pom.xml @@ -55,6 +55,10 @@ org.apache.hadoop hadoop-client + + org.apache.hadoop + hadoop-openstack + org.scalatest scalatest_${scala.binary.version} From 39a9737e16b27435f448030f1f7f7a6c506e08dc Mon Sep 17 00:00:00 2001 From: Gil Vernik Date: Thu, 12 Jun 2014 12:13:29 +0300 Subject: [PATCH 04/13] Spark integration with Openstack Swift --- core/pom.xml | 4 - docs/openstack-integration.md | 301 ++++++++++++++++++++++++---------- pom.xml | 13 +- yarn/pom.xml | 4 - 4 files changed, 215 insertions(+), 107 deletions(-) diff --git a/core/pom.xml b/core/pom.xml index 93dadafe57046..fe6b2daba0581 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -35,10 +35,6 @@ org.apache.hadoop hadoop-client - - org.apache.hadoop - hadoop-openstack - net.java.dev.jets3t jets3t diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md index a1aac02f6275e..a3179fce59c13 100644 --- a/docs/openstack-integration.md +++ b/docs/openstack-integration.md @@ -1,110 +1,237 @@ -yout: global -title: Accessing Openstack Swift storage from Spark +layout: global +title: Accessing Openstack Swift from Spark --- -# Accessing Openstack Swift storage from Spark +# Accessing Openstack Swift from Spark Spark's file interface allows it to process data in Openstack Swift using the same URI formats that are supported for Hadoop. You can specify a path in Swift as input through a -URI of the form `swift:///path`. You will also need to set your -Swift security credentials, through `SparkContext.hadoopConfiguration`. - -#Configuring Hadoop to use Openstack Swift -Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous Hadoop versions will need to configure Swift driver manually. Current Swift driver +URI of the form `swift:// - - fs.swift.impl - org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem - - +temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). -#Configuring Swift +# Configuring Swift Proxy server of Swift should include `list_endpoints` middleware. More information -available [here] (https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) - -#Configuring Spark -To use Swift driver, Spark need to be compiled with `hadoop-openstack-2.3.0.jar` -distributted with Hadoop 2.3.0. For the Maven builds, Spark's main pom.xml should include - - 2.3.0 +available [here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) +# Compilation of Spark +Spark should be compiled with `hadoop-openstack-2.3.0.jar` that is distributted with Hadoop 2.3.0. +For the Maven builds, the `dependencyManagement` section of Spark's main `pom.xml` should include + + --------- org.apache.hadoop hadoop-openstack - ${swift.version} + 2.3.0 + ---------- + -in addition, pom.xml of the `core` and `yarn` projects should include +in addition, both `core` and `yarn` projects should add `hadoop-openstack` to the `dependencies` section of their `pom.xml` + + ---------- org.apache.hadoop hadoop-openstack + ---------- + +# Configuration of Spark +Create `core-sites.xml` and place it inside `/spark/conf` directory. There are two main categories of parameters that should to be +configured: declaration of the Swift driver and the parameters that are required by Keystone. + +Configuration of Hadoop to use Swift File system achieved via + + + + + + + +
Property NameValue
fs.swift.implorg.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem
+ +Additional parameters requiered by Keystone and should be provided to the Swift driver. Those +parameters will be used to perform authentication in Keystone to access Swift. The following table +contains a list of Keystone mandatory parameters. `PROVIDER` can be any name. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Property NameMeaningRequired
fs.swift.service.PROVIDER.auth.urlKeystone Authentication URLMandatory
fs.swift.service.PROVIDER.auth.endpoint.prefixKeystone endpoints prefixOptional
fs.swift.service.PROVIDER.tenantTenantMandatory
fs.swift.service.PROVIDER.usernameUsernameMandatory
fs.swift.service.PROVIDER.passwordPasswordMandatory
fs.swift.service.PROVIDER.http.portHTTP portMandatory
fs.swift.service.PROVIDER.regionKeystone regionMandatory
fs.swift.service.PROVIDER.publicIndicates if all URLs are publicMandatory
+ +For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` with password `testing` defined for tenant `tenant`. +Than `core-sites.xml` should include: + + + fs.swift.impl + org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem + + + fs.swift.service.SparkTest.auth.url + http://127.0.0.1:5000/v2.0/tokens + + + fs.swift.service.SparkTest.auth.endpoint.prefix + endpoints + + fs.swift.service.SparkTest.http.port + 8080 + + + fs.swift.service.SparkTest.region + RegionOne + + + fs.swift.service.SparkTest.public + true + + + fs.swift.service.SparkTest.tenant + test + + + fs.swift.service.SparkTest.username + tester + + + fs.swift.service.SparkTest.password + testing + + -Additional parameters has to be provided to the Swift driver. Swift driver will use those -parameters to perform authentication in Keystone prior accessing Swift. List of mandatory -parameters is : `fs.swift.service..auth.url`, -`fs.swift.service..auth.endpoint.prefix`, `fs.swift.service..tenant`, -`fs.swift.service..username`, -`fs.swift.service..password`, `fs.swift.service..http.port`, -`fs.swift.service..http.port`, `fs.swift.service..public`, where -`PROVIDER` is any name. `fs.swift.service..auth.url` should point to the Keystone -authentication URL. - -Create core-sites.xml with the mandatory parameters and place it under /spark/conf -directory. For example: - - - - fs.swift.service..auth.url - http://127.0.0.1:5000/v2.0/tokens - - - fs.swift.service..auth.endpoint.prefix - endpoints - - fs.swift.service..http.port - 8080 - - - fs.swift.service..region - RegionOne - - - fs.swift.service..public - true - - -We left with `fs.swift.service..tenant`, `fs.swift.service..username`, -`fs.swift.service..password`. The best way to provide those parameters to -SparkContext in run time, which seems to be impossible yet. -Another approach is to adapt Swift driver to obtain those values from system environment -variables. For now we provide them via core-sites.xml. -Assume a tenant `test` with user `tester` was defined in Keystone, then the core-sites.xml -shoud include: - - - fs.swift.service..tenant - test - - - fs.swift.service..username - tester - - - fs.swift.service..password - testing - -# Usage -Assume there exists Swift container `logs` with an object `data.log`. To access `data.log` -from Spark the `swift://` scheme should be used. For example: - - val sfdata = sc.textFile("swift://logs./data.log") +Notice that `fs.swift.service.PROVIDER.tenant`, `fs.swift.service.PROVIDER.username`, +`fs.swift.service.PROVIDER.password` contains sensitive information and keeping them in `core-sites.xml` is not always a good approach. +We suggest to keep those parameters in `core-sites.xml` for testing purposes when running Spark via `spark-shell`. For job submissions they should be provided via `sparkContext.hadoopConfiguration` + +# Usage examples +Assume Keystone's authentication URL is `http://127.0.0.1:5000/v2.0/tokens` and Keystone contains tenant `test`, user `tester` with password `testing`. In our example we define `PROVIDER=SparkTest`. Assume that Swift contains container `logs` with an object `data.log`. To access `data.log` +from Spark the `swift://` scheme should be used. + +## Running Spark via spark-shell +Make sure that `core-sites.xml` contains `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`, +`fs.swift.service.SparkTest.password`. Run Spark via `spark-shell` and access Swift via `swift:\\` scheme. + + val sfdata = sc.textFile("swift://logs.SparkTest/data.log") + sfdata.count() + +## Job submission via spark-submit +In this case `core-sites.xml` need not contain `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`, +`fs.swift.service.SparkTest.password`. Example of Java usage: + + /* SimpleApp.java */ + import org.apache.spark.api.java.*; + import org.apache.spark.SparkConf; + import org.apache.spark.api.java.function.Function; + + public class SimpleApp { + public static void main(String[] args) { + String logFile = "swift://logs.SparkTest/data.log"; + SparkConf conf = new SparkConf().setAppName("Simple Application"); + JavaSparkContext sc = new JavaSparkContext(conf); + sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); + + JavaRDD logData = sc.textFile(logFile).cache(); + + long num = logData.count(); + + System.out.println("Total number of lines: " + num); + } + } + +The directory sturture is + + find . + ./src + ./src/main + ./src/main/java + ./src/main/java/SimpleApp.java + +Maven pom.xml is + + + edu.berkeley + simple-project + 4.0.0 + Simple Project + jar + 1.0 + + + Akka repository + http://repo.akka.io/releases + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 2.3 + + 1.6 + 1.6 + + + + + + + org.apache.spark + spark-core_2.10 + 1.0.0 + + + + + +Compile and execute + + mvn package + SPARK_HOME/spark-submit --class "SimpleApp" --master local[4] target/simple-project-1.0.jar diff --git a/pom.xml b/pom.xml index 79cf5fdc23d01..92cf6bab1edf8 100644 --- a/pom.xml +++ b/pom.xml @@ -132,8 +132,7 @@ 3.0.0 1.7.6 0.7.1 - 2.3.0 - + 64m 512m @@ -585,11 +584,6 @@
- - org.apache.hadoop - hadoop-openstack - ${swift.version} - org.apache.hadoop hadoop-yarn-api @@ -1030,11 +1024,6 @@ hadoop-client provided - - org.apache.hadoop - hadoop-openstack - provided - org.apache.hadoop hadoop-yarn-api diff --git a/yarn/pom.xml b/yarn/pom.xml index e58d8312f1a86..6993c89525d8c 100644 --- a/yarn/pom.xml +++ b/yarn/pom.xml @@ -55,10 +55,6 @@ org.apache.hadoop hadoop-client - - org.apache.hadoop - hadoop-openstack - org.scalatest scalatest_${scala.binary.version} From 99f095d9577802912fa715495bd9aec3e3867d54 Mon Sep 17 00:00:00 2001 From: Reynold Xin Date: Sat, 14 Jun 2014 13:13:17 -0700 Subject: [PATCH 05/13] Pending openstack changes. --- docs/openstack-integration.md | 382 ++++++++++++++++++---------------- 1 file changed, 207 insertions(+), 175 deletions(-) diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md index a3179fce59c13..f02e366075cd6 100644 --- a/docs/openstack-integration.md +++ b/docs/openstack-integration.md @@ -1,51 +1,68 @@ +--- layout: global -title: Accessing Openstack Swift from Spark +title: OpenStack Integration --- -# Accessing Openstack Swift from Spark +* This will become a table of contents (this text will be scraped). +{:toc} + + +# Accessing OpenStack Swift from Spark -Spark's file interface allows it to process data in Openstack Swift using the same URI +Spark's file interface allows it to process data in OpenStack Swift using the same URI formats that are supported for Hadoop. You can specify a path in Swift as input through a -URI of the form `swift://swift://. You will also need to set your +Swift security credentials, through core-sites.xml or via +SparkContext.hadoopConfiguration. +Openstack Swift driver was merged in Hadoop version 2.3.0 +([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). +Users that wish to use previous Hadoop versions will need to configure Swift driver manually. +Current Swift driver requires Swift to use Keystone authentication method. There are recent efforts +to support temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). # Configuring Swift -Proxy server of Swift should include `list_endpoints` middleware. More information -available [here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) - -# Compilation of Spark -Spark should be compiled with `hadoop-openstack-2.3.0.jar` that is distributted with Hadoop 2.3.0. -For the Maven builds, the `dependencyManagement` section of Spark's main `pom.xml` should include - - - --------- - - org.apache.hadoop - hadoop-openstack - 2.3.0 - - ---------- - - -in addition, both `core` and `yarn` projects should add `hadoop-openstack` to the `dependencies` section of their `pom.xml` - - - ---------- - - org.apache.hadoop - hadoop-openstack - - ---------- - -# Configuration of Spark -Create `core-sites.xml` and place it inside `/spark/conf` directory. There are two main categories of parameters that should to be -configured: declaration of the Swift driver and the parameters that are required by Keystone. +Proxy server of Swift should include list_endpoints middleware. More information +available +[here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) -Configuration of Hadoop to use Swift File system achieved via +# Dependencies + +Spark should be compiled with hadoop-openstack-2.3.0.jar that is distributted with +Hadoop 2.3.0. For the Maven builds, the dependencyManagement section of Spark's main +pom.xml should include: +{% highlight xml %} + + ... + + org.apache.hadoop + hadoop-openstack + 2.3.0 + + ... + +{% endhighlight %} + +In addition, both core and yarn projects should add +hadoop-openstack to the dependencies section of their +pom.xml: +{% highlight xml %} + + ... + + org.apache.hadoop + hadoop-openstack + + ... + +{% endhighlight %} +# Configuration Parameters + +Create core-sites.xml and place it inside /spark/conf directory. +There are two main categories of parameters that should to be configured: declaration of the +Swift driver and the parameters that are required by Keystone. + +Configuration of Hadoop to use Swift File system achieved via @@ -54,184 +71,199 @@ Configuration of Hadoop to use Swift File system achieved via
Property NameValue
-Additional parameters requiered by Keystone and should be provided to the Swift driver. Those +Additional parameters required by Keystone and should be provided to the Swift driver. Those parameters will be used to perform authentication in Keystone to access Swift. The following table -contains a list of Keystone mandatory parameters. `PROVIDER` can be any name. +contains a list of Keystone mandatory parameters. PROVIDER can be any name. - + - + - + - + - + - + - + - +
Property NameMeaningRequired
fs.swift.service.PROVIDER.auth.urlfs.swift.service.PROVIDER.auth.url Keystone Authentication URL Mandatory
fs.swift.service.PROVIDER.auth.endpoint.prefixfs.swift.service.PROVIDER.auth.endpoint.prefix Keystone endpoints prefix Optional
fs.swift.service.PROVIDER.tenantfs.swift.service.PROVIDER.tenant Tenant Mandatory
fs.swift.service.PROVIDER.usernamefs.swift.service.PROVIDER.username Username Mandatory
fs.swift.service.PROVIDER.passwordfs.swift.service.PROVIDER.password Password Mandatory
fs.swift.service.PROVIDER.http.portfs.swift.service.PROVIDER.http.port HTTP port Mandatory
fs.swift.service.PROVIDER.regionfs.swift.service.PROVIDER.region Keystone region Mandatory
fs.swift.service.PROVIDER.publicfs.swift.service.PROVIDER.public Indicates if all URLs are public Mandatory
-For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` with password `testing` defined for tenant `tenant`. -Than `core-sites.xml` should include: - - - - fs.swift.impl - org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem - - - fs.swift.service.SparkTest.auth.url - http://127.0.0.1:5000/v2.0/tokens - - - fs.swift.service.SparkTest.auth.endpoint.prefix - endpoints - - fs.swift.service.SparkTest.http.port - 8080 - - - fs.swift.service.SparkTest.region - RegionOne - - - fs.swift.service.SparkTest.public - true - - - fs.swift.service.SparkTest.tenant - test - - - fs.swift.service.SparkTest.username - tester - - - fs.swift.service.SparkTest.password - testing - - - -Notice that `fs.swift.service.PROVIDER.tenant`, `fs.swift.service.PROVIDER.username`, -`fs.swift.service.PROVIDER.password` contains sensitive information and keeping them in `core-sites.xml` is not always a good approach. -We suggest to keep those parameters in `core-sites.xml` for testing purposes when running Spark via `spark-shell`. For job submissions they should be provided via `sparkContext.hadoopConfiguration` +For example, assume PROVIDER=SparkTest and Keystone contains user tester with password testing +defined for tenant tenant. Than core-sites.xml should include: + +{% highlight xml %} + + + fs.swift.impl + org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem + + + fs.swift.service.SparkTest.auth.url + http://127.0.0.1:5000/v2.0/tokens + + + fs.swift.service.SparkTest.auth.endpoint.prefix + endpoints + + fs.swift.service.SparkTest.http.port + 8080 + + + fs.swift.service.SparkTest.region + RegionOne + + + fs.swift.service.SparkTest.public + true + + + fs.swift.service.SparkTest.tenant + test + + + fs.swift.service.SparkTest.username + tester + + + fs.swift.service.SparkTest.password + testing + + +{% endhighlight %} + +Notice that +fs.swift.service.PROVIDER.tenant, +fs.swift.service.PROVIDER.username, +fs.swift.service.PROVIDER.password contains sensitive information and keeping them in +core-sites.xml is not always a good approach. +We suggest to keep those parameters in core-sites.xml for testing purposes when running Spark +via spark-shell. +For job submissions they should be provided via sparkContext.hadoopConfiguration. # Usage examples -Assume Keystone's authentication URL is `http://127.0.0.1:5000/v2.0/tokens` and Keystone contains tenant `test`, user `tester` with password `testing`. In our example we define `PROVIDER=SparkTest`. Assume that Swift contains container `logs` with an object `data.log`. To access `data.log` -from Spark the `swift://` scheme should be used. + +Assume Keystone's authentication URL is http://127.0.0.1:5000/v2.0/tokens and Keystone contains tenant test, user tester with password testing. In our example we define PROVIDER=SparkTest. Assume that Swift contains container logs with an object data.log. To access data.log from Spark the swift:// scheme should be used. + ## Running Spark via spark-shell -Make sure that `core-sites.xml` contains `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`, -`fs.swift.service.SparkTest.password`. Run Spark via `spark-shell` and access Swift via `swift:\\` scheme. - - val sfdata = sc.textFile("swift://logs.SparkTest/data.log") - sfdata.count() - -## Job submission via spark-submit -In this case `core-sites.xml` need not contain `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`, -`fs.swift.service.SparkTest.password`. Example of Java usage: - - /* SimpleApp.java */ - import org.apache.spark.api.java.*; - import org.apache.spark.SparkConf; - import org.apache.spark.api.java.function.Function; - - public class SimpleApp { - public static void main(String[] args) { - String logFile = "swift://logs.SparkTest/data.log"; - SparkConf conf = new SparkConf().setAppName("Simple Application"); - JavaSparkContext sc = new JavaSparkContext(conf); - sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test"); - sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing"); - sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); - - JavaRDD logData = sc.textFile(logFile).cache(); - - long num = logData.count(); - - System.out.println("Total number of lines: " + num); - } - } - -The directory sturture is - - find . - ./src - ./src/main - ./src/main/java - ./src/main/java/SimpleApp.java - -Maven pom.xml is - - - edu.berkeley - simple-project - 4.0.0 - Simple Project - jar - 1.0 - - - Akka repository - http://repo.akka.io/releases - - - - - - org.apache.maven.plugins - maven-compiler-plugin - 2.3 - - 1.6 - 1.6 - - - - - - - org.apache.spark - spark-core_2.10 - 1.0.0 - - - - -Compile and execute +Make sure that core-sites.xml contains fs.swift.service.SparkTest.tenant, fs.swift.service.SparkTest.username, +fs.swift.service.SparkTest.password. Run Spark via spark-shell and access Swift via swift:// scheme. + +{% highlight scala %} +val sfdata = sc.textFile("swift://logs.SparkTest/data.log") +sfdata.count() +{% endhighlight %} + + +## Sample Application + +In this case core-sites.xml need not contain fs.swift.service.SparkTest.tenant, fs.swift.service.SparkTest.username, +fs.swift.service.SparkTest.password. Example of Java usage: - mvn package - SPARK_HOME/spark-submit --class "SimpleApp" --master local[4] target/simple-project-1.0.jar +{% highlight java %} +/* SimpleApp.java */ +import org.apache.spark.api.java.*; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.function.Function; +public class SimpleApp { + public static void main(String[] args) { + String logFile = "swift://logs.SparkTest/data.log"; + SparkConf conf = new SparkConf().setAppName("Simple Application"); + JavaSparkContext sc = new JavaSparkContext(conf); + sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); + + JavaRDD logData = sc.textFile(logFile).cache(); + + long num = logData.count(); + + System.out.println("Total number of lines: " + num); + } +} +{% endhighlight %} + +The directory structure is +{% highlight bash %} +./src +./src/main +./src/main/java +./src/main/java/SimpleApp.java +{% endhighlight %} + +Maven pom.xml should contain: +{% highlight xml %} + + edu.berkeley + simple-project + 4.0.0 + Simple Project + jar + 1.0 + + + Akka repository + http://repo.akka.io/releases + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 2.3 + + 1.6 + 1.6 + + + + + + + org.apache.spark + spark-core_2.10 + 1.0.0 + + + +{% endhighlight %} + +Compile and execute +{% highlight bash %} +mvn package +SPARK_HOME/spark-submit --class SimpleApp --master local[4] target/simple-project-1.0.jar +{% endhighlight %} From cca719227c828e790f7b2e3c94ef83f5fb55ceb3 Mon Sep 17 00:00:00 2001 From: Gil Vernik Date: Mon, 16 Jun 2014 11:39:51 +0300 Subject: [PATCH 06/13] Removed white spases from pom.xml --- pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pom.xml b/pom.xml index 92cf6bab1edf8..86264d1132ec4 100644 --- a/pom.xml +++ b/pom.xml @@ -132,7 +132,7 @@ 3.0.0 1.7.6 0.7.1 - + 64m 512m From ac0679eb81389cef45c3b604581fb53274023f1c Mon Sep 17 00:00:00 2001 From: Reynold Xin Date: Mon, 16 Jun 2014 18:46:14 -0700 Subject: [PATCH 07/13] Fixed an unclosed tr. --- docs/openstack-integration.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md index f02e366075cd6..49661ef197585 100644 --- a/docs/openstack-integration.md +++ b/docs/openstack-integration.md @@ -63,12 +63,13 @@ There are two main categories of parameters that should to be configured: declar Swift driver and the parameters that are required by Keystone. Configuration of Hadoop to use Swift File system achieved via + - +
Property NameValue
fs.swift.impl org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem
Additional parameters required by Keystone and should be provided to the Swift driver. Those From 9233fef3450846fc6ff1e7e7e3c75191a543a573 Mon Sep 17 00:00:00 2001 From: Gil Vernik Date: Wed, 18 Jun 2014 08:19:48 +0300 Subject: [PATCH 08/13] Fixed typos --- docs/openstack-integration.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md index 49661ef197585..ac5b5a34a141c 100644 --- a/docs/openstack-integration.md +++ b/docs/openstack-integration.md @@ -209,7 +209,6 @@ public class SimpleApp { sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); JavaRDD logData = sc.textFile(logFile).cache(); - long num = logData.count(); System.out.println("Total number of lines: " + num); From 0447c9fc563ef4bc02f937bfea63ea1d62f252cf Mon Sep 17 00:00:00 2001 From: Reynold Xin Date: Fri, 5 Sep 2014 23:59:34 -0700 Subject: [PATCH 09/13] Removed sample code. --- core/pom.xml | 2 +- docs/openstack-integration.md | 131 +++------------------------------- 2 files changed, 10 insertions(+), 123 deletions(-) diff --git a/core/pom.xml b/core/pom.xml index 746862892f074..55bfe0b841ea4 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -44,7 +44,7 @@
- + net.java.dev.jets3t jets3t diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md index ac5b5a34a141c..ff3cf95ac2f0b 100644 --- a/docs/openstack-integration.md +++ b/docs/openstack-integration.md @@ -1,6 +1,6 @@ --- layout: global -title: OpenStack Integration +title: OpenStack Swift Integration --- * This will become a table of contents (this text will be scraped). @@ -9,16 +9,12 @@ title: OpenStack Integration # Accessing OpenStack Swift from Spark -Spark's file interface allows it to process data in OpenStack Swift using the same URI -formats that are supported for Hadoop. You can specify a path in Swift as input through a -URI of the form swift://. You will also need to set your +Spark's support for Hadoop InputFormat allows it to process data in OpenStack Swift using the +same URI formats as in Hadoop. You can specify a path in Swift as input through a +URI of the form swift://container.PROVIDER/path. You will also need to set your Swift security credentials, through core-sites.xml or via -SparkContext.hadoopConfiguration. -Openstack Swift driver was merged in Hadoop version 2.3.0 -([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). -Users that wish to use previous Hadoop versions will need to configure Swift driver manually. -Current Swift driver requires Swift to use Keystone authentication method. There are recent efforts -to support temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). +SparkContext.hadoopConfiguration. +Current Swift driver requires Swift to use Keystone authentication method. # Configuring Swift Proxy server of Swift should include list_endpoints middleware. More information @@ -27,9 +23,9 @@ available # Dependencies -Spark should be compiled with hadoop-openstack-2.3.0.jar that is distributted with -Hadoop 2.3.0. For the Maven builds, the dependencyManagement section of Spark's main -pom.xml should include: +The Spark application should include hadoop-openstack dependency. +For example, for Maven support, add the following to the pom.xml file: + {% highlight xml %} ... @@ -42,19 +38,6 @@ Hadoop 2.3.0. For the Maven builds, the dependencyManagement sectio {% endhighlight %} -In addition, both core and yarn projects should add -hadoop-openstack to the dependencies section of their -pom.xml: -{% highlight xml %} - - ... - - org.apache.hadoop - hadoop-openstack - - ... - -{% endhighlight %} # Configuration Parameters @@ -171,99 +154,3 @@ Notice that We suggest to keep those parameters in core-sites.xml for testing purposes when running Spark via spark-shell. For job submissions they should be provided via sparkContext.hadoopConfiguration. - -# Usage examples - -Assume Keystone's authentication URL is http://127.0.0.1:5000/v2.0/tokens and Keystone contains tenant test, user tester with password testing. In our example we define PROVIDER=SparkTest. Assume that Swift contains container logs with an object data.log. To access data.log from Spark the swift:// scheme should be used. - - -## Running Spark via spark-shell - -Make sure that core-sites.xml contains fs.swift.service.SparkTest.tenant, fs.swift.service.SparkTest.username, -fs.swift.service.SparkTest.password. Run Spark via spark-shell and access Swift via swift:// scheme. - -{% highlight scala %} -val sfdata = sc.textFile("swift://logs.SparkTest/data.log") -sfdata.count() -{% endhighlight %} - - -## Sample Application - -In this case core-sites.xml need not contain fs.swift.service.SparkTest.tenant, fs.swift.service.SparkTest.username, -fs.swift.service.SparkTest.password. Example of Java usage: - -{% highlight java %} -/* SimpleApp.java */ -import org.apache.spark.api.java.*; -import org.apache.spark.SparkConf; -import org.apache.spark.api.java.function.Function; - -public class SimpleApp { - public static void main(String[] args) { - String logFile = "swift://logs.SparkTest/data.log"; - SparkConf conf = new SparkConf().setAppName("Simple Application"); - JavaSparkContext sc = new JavaSparkContext(conf); - sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test"); - sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing"); - sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); - - JavaRDD logData = sc.textFile(logFile).cache(); - long num = logData.count(); - - System.out.println("Total number of lines: " + num); - } -} -{% endhighlight %} - -The directory structure is -{% highlight bash %} -./src -./src/main -./src/main/java -./src/main/java/SimpleApp.java -{% endhighlight %} - -Maven pom.xml should contain: -{% highlight xml %} - - edu.berkeley - simple-project - 4.0.0 - Simple Project - jar - 1.0 - - - Akka repository - http://repo.akka.io/releases - - - - - - org.apache.maven.plugins - maven-compiler-plugin - 2.3 - - 1.6 - 1.6 - - - - - - - org.apache.spark - spark-core_2.10 - 1.0.0 - - - -{% endhighlight %} - -Compile and execute -{% highlight bash %} -mvn package -SPARK_HOME/spark-submit --class SimpleApp --master local[4] target/simple-project-1.0.jar -{% endhighlight %} From 846f5cbbb605421646587cb2e065f070d83143ae Mon Sep 17 00:00:00 2001 From: Reynold Xin Date: Sat, 6 Sep 2014 00:05:18 -0700 Subject: [PATCH 10/13] Added a link from overview page. --- docs/index.md | 2 ++ ...penstack-integration.md => storage-openstack-swift.md} | 8 +------- 2 files changed, 3 insertions(+), 7 deletions(-) rename docs/{openstack-integration.md => storage-openstack-swift.md} (96%) diff --git a/docs/index.md b/docs/index.md index 4ac0982ae54f1..7fe6b43d32af7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -103,6 +103,8 @@ options for deployment: * [Security](security.html): Spark security support * [Hardware Provisioning](hardware-provisioning.html): recommendations for cluster hardware * [3rd Party Hadoop Distributions](hadoop-third-party-distributions.html): using common Hadoop distributions +* Integration with other storage systems: + * [OpenStack Swift](storage-openstack-swift.html) * [Building Spark with Maven](building-with-maven.html): build Spark using the Maven system * [Contributing to Spark](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) diff --git a/docs/openstack-integration.md b/docs/storage-openstack-swift.md similarity index 96% rename from docs/openstack-integration.md rename to docs/storage-openstack-swift.md index ff3cf95ac2f0b..931e995e0f014 100644 --- a/docs/openstack-integration.md +++ b/docs/storage-openstack-swift.md @@ -1,14 +1,8 @@ --- layout: global -title: OpenStack Swift Integration +title: Accessing OpenStack Swift from Spark --- -* This will become a table of contents (this text will be scraped). -{:toc} - - -# Accessing OpenStack Swift from Spark - Spark's support for Hadoop InputFormat allows it to process data in OpenStack Swift using the same URI formats as in Hadoop. You can specify a path in Swift as input through a URI of the form swift://container.PROVIDER/path. You will also need to set your From dfb8fea59caee1e9fed743d3df18075cba510172 Mon Sep 17 00:00:00 2001 From: Reynold Xin Date: Sun, 7 Sep 2014 18:53:06 -0700 Subject: [PATCH 11/13] Updated based on Gil's suggestion. --- docs/storage-openstack-swift.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/storage-openstack-swift.md b/docs/storage-openstack-swift.md index 931e995e0f014..ad0d3c6d7deaf 100644 --- a/docs/storage-openstack-swift.md +++ b/docs/storage-openstack-swift.md @@ -10,10 +10,12 @@ Swift security credentials, through core-sites.xml or via SparkContext.hadoopConfiguration. Current Swift driver requires Swift to use Keystone authentication method. -# Configuring Swift -Proxy server of Swift should include list_endpoints middleware. More information -available -[here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) +# Configuring Swift for Better Data Locality + +Although not mandatory, it is recommended to configure the proxy server of Swift with +list_endpoints to have better data locality. More information is +[available here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py). + # Dependencies @@ -49,7 +51,7 @@ Configuration of Hadoop to use Swift File system achieved via -Additional parameters required by Keystone and should be provided to the Swift driver. Those +Additional parameters required by Keystone (v2.0) and should be provided to the Swift driver. Those parameters will be used to perform authentication in Keystone to access Swift. The following table contains a list of Keystone mandatory parameters. PROVIDER can be any name. @@ -98,7 +100,7 @@ contains a list of Keystone mandatory parameters. PROVIDER can be a For example, assume PROVIDER=SparkTest and Keystone contains user tester with password testing -defined for tenant tenant. Than core-sites.xml should include: +defined for tenant test. Than core-sites.xml should include: {% highlight xml %} From 279f6dea0c781133ac90ea2c6c550c443b211fcc Mon Sep 17 00:00:00 2001 From: Reynold Xin Date: Sun, 7 Sep 2014 20:47:16 -0700 Subject: [PATCH 12/13] core-sites -> core-site --- docs/storage-openstack-swift.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/storage-openstack-swift.md b/docs/storage-openstack-swift.md index ad0d3c6d7deaf..54e5bdce17672 100644 --- a/docs/storage-openstack-swift.md +++ b/docs/storage-openstack-swift.md @@ -6,7 +6,7 @@ title: Accessing OpenStack Swift from Spark Spark's support for Hadoop InputFormat allows it to process data in OpenStack Swift using the same URI formats as in Hadoop. You can specify a path in Swift as input through a URI of the form swift://container.PROVIDER/path. You will also need to set your -Swift security credentials, through core-sites.xml or via +Swift security credentials, through core-site.xml or via SparkContext.hadoopConfiguration. Current Swift driver requires Swift to use Keystone authentication method. @@ -37,7 +37,7 @@ For example, for Maven support, add the following to the pom.xml fi # Configuration Parameters -Create core-sites.xml and place it inside /spark/conf directory. +Create core-site.xml and place it inside /spark/conf directory. There are two main categories of parameters that should to be configured: declaration of the Swift driver and the parameters that are required by Keystone. @@ -100,7 +100,7 @@ contains a list of Keystone mandatory parameters. PROVIDER can be a For example, assume PROVIDER=SparkTest and Keystone contains user tester with password testing -defined for tenant test. Than core-sites.xml should include: +defined for tenant test. Than core-site.xml should include: {% highlight xml %} @@ -146,7 +146,7 @@ Notice that fs.swift.service.PROVIDER.tenant, fs.swift.service.PROVIDER.username, fs.swift.service.PROVIDER.password contains sensitive information and keeping them in -core-sites.xml is not always a good approach. -We suggest to keep those parameters in core-sites.xml for testing purposes when running Spark +core-site.xml is not always a good approach. +We suggest to keep those parameters in core-site.xml for testing purposes when running Spark via spark-shell. For job submissions they should be provided via sparkContext.hadoopConfiguration. From ff4e3949052ec7e28f669e82c61499cd9c11a2fa Mon Sep 17 00:00:00 2001 From: Reynold Xin Date: Sun, 7 Sep 2014 20:52:31 -0700 Subject: [PATCH 13/13] Two minor comments from Patrick. --- docs/storage-openstack-swift.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/storage-openstack-swift.md b/docs/storage-openstack-swift.md index 54e5bdce17672..c39ef1ce59e1c 100644 --- a/docs/storage-openstack-swift.md +++ b/docs/storage-openstack-swift.md @@ -37,7 +37,7 @@ For example, for Maven support, add the following to the pom.xml fi # Configuration Parameters -Create core-site.xml and place it inside /spark/conf directory. +Create core-site.xml and place it inside Spark's conf directory. There are two main categories of parameters that should to be configured: declaration of the Swift driver and the parameters that are required by Keystone. @@ -100,7 +100,7 @@ contains a list of Keystone mandatory parameters. PROVIDER can be a For example, assume PROVIDER=SparkTest and Keystone contains user tester with password testing -defined for tenant test. Than core-site.xml should include: +defined for tenant test. Then core-site.xml should include: {% highlight xml %}