Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kid xiong.fea.input.hive #163

Merged
merged 13 commits into from
Nov 23, 2018
7 changes: 3 additions & 4 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,11 @@ assemblyMergeStrategy in assembly := {
case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last
case PathList("com", "codahale", xs @ _*) => MergeStrategy.last
case PathList("com", "yammer", xs @ _*) => MergeStrategy.last
case "about.html" => MergeStrategy.rename
case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
case "META-INF/mailcap" => MergeStrategy.last
case "META-INF/mimetypes.default" => MergeStrategy.last
case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first
case PathList(ps @ _*) if ps.last endsWith ".xml" => MergeStrategy.first
case PathList(ps @ _*) if ps.last endsWith ".class" => MergeStrategy.first
case PathList(ps @ _*) if ps.last endsWith ".thrift" => MergeStrategy.first
case "UnusedStubClass.class" => MergeStrategy.last
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
Expand Down
4 changes: 4 additions & 0 deletions docs/zh-cn/configuration/_sidebar.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
- [Hdfs](/zh-cn/configuration/input-plugins/Hdfs)
- [HdfsStream](/zh-cn/configuration/input-plugins/HdfsStream)
- [KafkaStream](/zh-cn/configuration/input-plugins/KafkaStream)
- [Kudu](/zh-cn/configuration/input-plugins/Kudu)
- [MongoDB](/zh-cn/configuration/input-plugins/MongoDB)
- [S3Stream](/zh-cn/configuration/input-plugins/S3Stream)
- [SocketStream](/zh-cn/configuration/input-plugins/SocketStream)

Expand Down Expand Up @@ -49,6 +51,8 @@
- [Hdfs](/zh-cn/configuration/output-plugins/Hdfs)
- [Jdbc](/zh-cn/configuration/output-plugins/Jdbc)
- [Kafka](/zh-cn/configuration/output-plugins/Kafka)
- [Kudu](/zh-cn/configuration/output-plugins/Kudu)
- [MongoDB](/zh-cn/configuration/output-plugins/MongoDB)
- [MySQL](/zh-cn/configuration/output-plugins/MySQL)
- [S3](/zh-cn/configuration/output-plugins/S3)
- [Stdout](/zh-cn/configuration/output-plugins/Stdout)
Expand Down
11 changes: 11 additions & 0 deletions docs/zh-cn/configuration/input-plugins/Hive.docs
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
@waterdropPlugin
@pluginGroup input
@pluginName Hive
@pluginDesc "从hive读取原始数据"
@pluginAuthor InterestingLab
@pluginHomepage https://interestinglab.github.io/waterdrop
@pluginVersion 1.0.0

@pluginOption
string pre_sql yes "进行预处理的sql, 如果不需要预处理,可以使用select * from hive_db.hive_table"
string table_name yes "预处理sql的到数据注册成的临时表名"
40 changes: 40 additions & 0 deletions docs/zh-cn/configuration/input-plugins/Hive.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
## Input plugin : Hive

* Author: InterestingLab
* Homepage: https://interestinglab.github.io/waterdrop
* Version: 1.0.0

### Description

从hive中获取数据,

### Options

| name | type | required | default value |
| --- | --- | --- | --- |
| [pre_sql](#pre_sql-string) | string | yes | - |
| [table_name](#table_name-string) | string | yes | - |


##### pre_sql [string]

进行预处理的sql, 如果不需要预处理,可以使用select * from hive_db.hive_table

##### table_name [string]

经过pre_sql获取到的数据,注册成临时表的表名



### Example

```
hive {
pre_sql = "select * from mydb.mytb"
table_name = "myTable"
}
```

### Notes
cluster和client模式下必须把hadoopConf和hive-site.xml置于集群每个节点sparkconf目录下,本地调试将其放在resources目录

12 changes: 12 additions & 0 deletions docs/zh-cn/configuration/input-plugins/Kudu.docs
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
@waterdropPlugin
@pluginGroup input
@pluginName Kudu
@pluginDesc "从[Apache Kudu](https://kudu.apache.org) 表中读取数据"
@pluginAuthor InterestingLab
@pluginHomepage https://interestinglab.github.io/waterdrop
@pluginVersion 1.0.0

@pluginOption
string kudu_master yes "kudu的master,多个master以逗号隔开"
string kudu_table yes "kudu要读取的表名"
string table_name yes "获取到数据注册成的临时表名"
42 changes: 42 additions & 0 deletions docs/zh-cn/configuration/input-plugins/Kudu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
## Input plugin : Kudu

* Author: InterestingLab
* Homepage: https://interestinglab.github.io/waterdrop
* Version: 1.0.0

### Description

从[Apache Kudu](https://kudu.apache.org) 表中读取数据.

### Options

| name | type | required | default value |
| --- | --- | --- | --- |
| [kudu_master](#kudu_master-string) | string | yes | - |
| [kudu_table](#kudu_table) | string | yes | - |
| [table_name](#table_name-string) | string | yes | - |


##### kudu_master [string]

kudu的master,多个master以逗号隔开

##### kudu_table [string]

kudu中要读取的表名

##### table_name [string]

获取到的数据,注册成临时表的表名



### Example

```
kudu{
kudu_master="hadoop01:7051,hadoop02:7051,hadoop03:7051"
kudu_table="my_kudu_table"
table_name="reg_table"
}
```
13 changes: 13 additions & 0 deletions docs/zh-cn/configuration/input-plugins/MongoDB.docs
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
@waterdropPlugin
@pluginGroup input
@pluginName MongoDB
@pluginDesc "从[MongoDB](https://www.mongodb.com/)读取数据"
@pluginAuthor InterestingLab
@pluginHomepage https://interestinglab.github.io/waterdrop
@pluginVersion 1.0.0

@pluginOption
string readConfig.uri yes "mongoDB uri"
string readConfig.database yes "要读取的database"
string readConfig.collection yes "要读取的collection"
string table_name yes "读取数据注册成的临时表名"
54 changes: 54 additions & 0 deletions docs/zh-cn/configuration/input-plugins/MongoDB.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
## Input plugin : MongoDB

* Author: InterestingLab
* Homepage: https://interestinglab.github.io/waterdrop
* Version: 1.0.0

### Description

从[MongoDB](https://www.mongodb.com/)读取数据

### Options

| name | type | required | default value |
| --- | --- | --- | --- |
| [readconfig.uri](#readconfig.uri-string) | string | yes | - |
| [readconfig.database](#readconfig.database-string) | string | yes | - |
| [readconfig.collection](#readconfig.collection-string) | string | yes | - |
| [readconfig.*](#readconfig.*-string) | string | no | - |
| [table_name](#table_name-string) | string | yes | - |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readConfig 统一改小写 -> readconfig

options列表里面增加一个:readconfig.*,参考一下KafkaStream



##### readconfig.uri [string]

要读取mongoDB的uri

##### readconfig.database [string]

要读取mongoDB的database

##### readconfig.collection [string]

要读取mongoDB的collection

#### readconfig

这里还可以配置更多其他参数,详见https://docs.mongodb.com/spark-connector/v1.1/configuration/, 参见其中的`Input Configuration`部分
指定参数的方式是在原参数名称上加上前缀"readconfig." 如设置`spark.mongodb.input.partitioner`的方式是 `readconfig.spark.mongodb.input.partitioner="MongoPaginateBySizePartitioner"`。如果不指定这些非必须参数,将使用MongoDB官方文档的默认值

##### table_name [string]

从mongoDB获取到的数据,注册成临时表的表名


### Example

```
mongodb{
readconfig.uri="mongodb://myhost:mypost"
readconfig.database="mydatabase"
readconfig.collection="mycollection"
readconfig.spark.mongodb.input.partitioner = "MongoPaginateBySizePartitioner"
table_name = "test"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

记得把这的大小写改了

```
12 changes: 12 additions & 0 deletions docs/zh-cn/configuration/output-plugins/Kudu.docs
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
@waterdropPlugin
@pluginGroup output
@pluginName Kudu
@pluginDesc "写入数据到[Apache Kudu](https://kudu.apache.org)表中"
@pluginAuthor InterestingLab
@pluginHomepage https://interestinglab.github.io/waterdrop
@pluginVersion 1.0.0

@pluginOption
string kudu_master yes "kudu的master,多个master以逗号隔开"
string kudu_table yes "kudu中要写入的表名,表必须已经存在"
string mode="insert" no "写入kudu模式 insert|update|upsert|insertIgnore"
43 changes: 43 additions & 0 deletions docs/zh-cn/configuration/output-plugins/Kudu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
## Output plugin : Kudu

* Author: InterestingLab
* Homepage: https://interestinglab.github.io/waterdrop
* Version: 1.0.0

### Description

写入数据到[Apache Kudu](https://kudu.apache.org)表中

### Options

| name | type | required | default value |
| --- | --- | --- | --- |
| [kudu_master](#kudu_master-string) | string | yes | - |
| [kudu_table](#kudu_table) | string | yes | - |
| [mode](#mode-string) | string | no | insert |


##### kudu_master [string]

kudu的master,多个master以逗号隔开

##### kudu_table [string]

kudu中要写入的表名,表必须已经存在

##### mode [string]

写入kudu中采取的模式,支持 insert|update|upsert|insertIgnore,默认为insert
insert和insertIgnore :insert在遇见主键冲突将会报错,insertIgnore不会报错,将会舍弃这条数据
update和upsert :update找不到要更新的主键将会报错,upsert不会,将会把这条数据插入


### Example

```
kudu{
kudu_master="hadoop01:7051,hadoop02:7051,hadoop03:7051"
kudu_table="my_kudu_table"
mode="upsert"
}
```
12 changes: 12 additions & 0 deletions docs/zh-cn/configuration/output-plugins/MongoDB.docs
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
@waterdropPlugin
@pluginGroup output
@pluginName MongoDB
@pluginDesc "写入数据到[MongoDB](https://www.mongodb.com/)"
@pluginAuthor InterestingLab
@pluginHomepage https://interestinglab.github.io/waterdrop
@pluginVersion 1.0.0

@pluginOption
string readConfig.uri yes "mongoDB uri"
string readConfig.database yes "要写入的database"
string readConfig.collection yes "要写入的collection"
49 changes: 49 additions & 0 deletions docs/zh-cn/configuration/output-plugins/MongoDB.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## Output plugin : MongoDB

* Author: InterestingLab
* Homepage: https://interestinglab.github.io/waterdrop
* Version: 1.0.0

### Description

写入数据到[MongoDB](https://www.mongodb.com/)

### Options

| name | type | required | default value |
| --- | --- | --- | --- |
| [writeconfig.uri](#writeconfig.uri-string) | string | yes | - |
| [writeconfig.database](#writeconfig.database-string) | string | yes | - |
| [writeconfig.collection](#writeconfig.collection-string) | string | yes | - |
| [writeconfig.*](#writeconfig.*-string) | string | no | - |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeConfig 统一改小写writeconfig
options列表里面增加一个writeconfig.*,参见KafkaStream



##### writeconfig.uri [string]

要写入mongoDB的uri

##### writeconfig.database [string]

要写入mongoDB的database

##### writeconfig.collection [string]

要写入mongoDB的collection

#### writeconfig

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成writeconfig.*, 能像KafkaStream那样,举个例子说明这里的其他参数怎么配置,如果你懒得写,用户更懒得用Waterdrop.

这里还可以配置更多其他参数,详见https://docs.mongodb.com/spark-connector/v1.1/configuration/
, 参见其中的`Output Configuration`部分
指定参数的方式是在原参数名称上加上前缀"writeconfig." 如设置`localThreshold`的方式是 `writeconfig.localThreshold=20`。如果不指定这些非必须参数,将使用MongoDB官方文档的默认值


### Example

```
mongodb{
writeconfig.uri="mongodb://myhost:mypost"
writeconfig.database="mydatabase"
writeconfig.collection="mycollection"
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

自己的文档都写错了不看吗?用户还怎么看?

5 changes: 4 additions & 1 deletion waterdrop-core/build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ libraryDependencies ++= Seq(
exclude("org.spark-project.spark", "unused")
exclude("net.jpountz.lz4", "unused"),
"com.typesafe" % "config" % "1.3.1",
"org.apache.spark" %% "spark-hive" % sparkVersion ,
"org.mongodb.spark" %% "mongo-spark-connector" % "2.2.0",
"org.apache.kudu" %% "kudu-spark2" % "1.7.0",
"com.alibaba" % "QLExpress" % "3.2.0",
"com.alibaba" % "fastjson" % "1.2.47",
"commons-lang" % "commons-lang" % "2.6",
Expand All @@ -41,7 +44,7 @@ libraryDependencies ++= Seq(
"org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.6.3",
"com.github.scopt" %% "scopt" % "3.7.0",
"org.apache.commons" % "commons-compress" % "1.15",
"ru.yandex.clickhouse" % "clickhouse-jdbc" % "0.1.39" excludeAll(ExclusionRule(organization="com.fasterxml.jackson.core"))
"ru.yandex.clickhouse" % "clickhouse-jdbc" % "0.1.39" exclude("com.google.guava","guava") excludeAll(ExclusionRule(organization="com.fasterxml.jackson.core"))
)

// For binary compatible conflicts, sbt provides dependency overrides.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,6 @@ io.github.interestinglab.waterdrop.output.Jdbc
io.github.interestinglab.waterdrop.output.Kafka
io.github.interestinglab.waterdrop.output.Mysql
io.github.interestinglab.waterdrop.output.S3
io.github.interestinglab.waterdrop.output.Stdout
io.github.interestinglab.waterdrop.output.Stdout
io.github.interestinglab.waterdrop.output.MongoDB
io.github.interestinglab.waterdrop.output.Kudu
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
io.github.interestinglab.waterdrop.input.Fake2
io.github.interestinglab.waterdrop.input.Hdfs
io.github.interestinglab.waterdrop.input.File
io.github.interestinglab.waterdrop.input.File
io.github.interestinglab.waterdrop.input.Hive
io.github.interestinglab.waterdrop.input.MongoDB
io.github.interestinglab.waterdrop.input.Kudu
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ object Waterdrop extends Logging {
println("\t" + key + " => " + value)
})

val sparkSession = SparkSession.builder.config(sparkConf).getOrCreate()
val sparkSession = SparkSession.builder.config(sparkConf).enableHiveSupport().getOrCreate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当用户不使用hive input的时候,这个是否有副作用


// find all user defined UDFs and register in application init
UdfRegister.findAndRegisterUdfs(sparkSession)
Expand Down
Loading