-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kid xiong.fea.input.hive #163
Changes from all commits
8b33d33
62b6c08
7849e75
d246d81
e0c25e8
8ef1399
69496ba
24dc8a0
f73b8a3
20aa154
75d2848
3acd29a
daa5906
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
@waterdropPlugin | ||
@pluginGroup input | ||
@pluginName Hive | ||
@pluginDesc "从hive读取原始数据" | ||
@pluginAuthor InterestingLab | ||
@pluginHomepage https://interestinglab.github.io/waterdrop | ||
@pluginVersion 1.0.0 | ||
|
||
@pluginOption | ||
string pre_sql yes "进行预处理的sql, 如果不需要预处理,可以使用select * from hive_db.hive_table" | ||
string table_name yes "预处理sql的到数据注册成的临时表名" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
## Input plugin : Hive | ||
|
||
* Author: InterestingLab | ||
* Homepage: https://interestinglab.github.io/waterdrop | ||
* Version: 1.0.0 | ||
|
||
### Description | ||
|
||
从hive中获取数据, | ||
|
||
### Options | ||
|
||
| name | type | required | default value | | ||
| --- | --- | --- | --- | | ||
| [pre_sql](#pre_sql-string) | string | yes | - | | ||
| [table_name](#table_name-string) | string | yes | - | | ||
|
||
|
||
##### pre_sql [string] | ||
|
||
进行预处理的sql, 如果不需要预处理,可以使用select * from hive_db.hive_table | ||
|
||
##### table_name [string] | ||
|
||
经过pre_sql获取到的数据,注册成临时表的表名 | ||
|
||
|
||
|
||
### Example | ||
|
||
``` | ||
hive { | ||
pre_sql = "select * from mydb.mytb" | ||
table_name = "myTable" | ||
} | ||
``` | ||
|
||
### Notes | ||
cluster和client模式下必须把hadoopConf和hive-site.xml置于集群每个节点sparkconf目录下,本地调试将其放在resources目录 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
@waterdropPlugin | ||
@pluginGroup input | ||
@pluginName Kudu | ||
@pluginDesc "从[Apache Kudu](https://kudu.apache.org) 表中读取数据" | ||
@pluginAuthor InterestingLab | ||
@pluginHomepage https://interestinglab.github.io/waterdrop | ||
@pluginVersion 1.0.0 | ||
|
||
@pluginOption | ||
string kudu_master yes "kudu的master,多个master以逗号隔开" | ||
string kudu_table yes "kudu要读取的表名" | ||
string table_name yes "获取到数据注册成的临时表名" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
## Input plugin : Kudu | ||
|
||
* Author: InterestingLab | ||
* Homepage: https://interestinglab.github.io/waterdrop | ||
* Version: 1.0.0 | ||
|
||
### Description | ||
|
||
从[Apache Kudu](https://kudu.apache.org) 表中读取数据. | ||
|
||
### Options | ||
|
||
| name | type | required | default value | | ||
| --- | --- | --- | --- | | ||
| [kudu_master](#kudu_master-string) | string | yes | - | | ||
| [kudu_table](#kudu_table) | string | yes | - | | ||
| [table_name](#table_name-string) | string | yes | - | | ||
|
||
|
||
##### kudu_master [string] | ||
|
||
kudu的master,多个master以逗号隔开 | ||
|
||
##### kudu_table [string] | ||
|
||
kudu中要读取的表名 | ||
|
||
##### table_name [string] | ||
|
||
获取到的数据,注册成临时表的表名 | ||
|
||
|
||
|
||
### Example | ||
|
||
``` | ||
kudu{ | ||
kudu_master="hadoop01:7051,hadoop02:7051,hadoop03:7051" | ||
kudu_table="my_kudu_table" | ||
table_name="reg_table" | ||
} | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
@waterdropPlugin | ||
@pluginGroup input | ||
@pluginName MongoDB | ||
@pluginDesc "从[MongoDB](https://www.mongodb.com/)读取数据" | ||
@pluginAuthor InterestingLab | ||
@pluginHomepage https://interestinglab.github.io/waterdrop | ||
@pluginVersion 1.0.0 | ||
|
||
@pluginOption | ||
string readConfig.uri yes "mongoDB uri" | ||
string readConfig.database yes "要读取的database" | ||
string readConfig.collection yes "要读取的collection" | ||
string table_name yes "读取数据注册成的临时表名" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
## Input plugin : MongoDB | ||
|
||
* Author: InterestingLab | ||
* Homepage: https://interestinglab.github.io/waterdrop | ||
* Version: 1.0.0 | ||
|
||
### Description | ||
|
||
从[MongoDB](https://www.mongodb.com/)读取数据 | ||
|
||
### Options | ||
|
||
| name | type | required | default value | | ||
| --- | --- | --- | --- | | ||
| [readconfig.uri](#readconfig.uri-string) | string | yes | - | | ||
| [readconfig.database](#readconfig.database-string) | string | yes | - | | ||
| [readconfig.collection](#readconfig.collection-string) | string | yes | - | | ||
| [readconfig.*](#readconfig.*-string) | string | no | - | | ||
| [table_name](#table_name-string) | string | yes | - | | ||
|
||
|
||
##### readconfig.uri [string] | ||
|
||
要读取mongoDB的uri | ||
|
||
##### readconfig.database [string] | ||
|
||
要读取mongoDB的database | ||
|
||
##### readconfig.collection [string] | ||
|
||
要读取mongoDB的collection | ||
|
||
#### readconfig | ||
|
||
这里还可以配置更多其他参数,详见https://docs.mongodb.com/spark-connector/v1.1/configuration/, 参见其中的`Input Configuration`部分 | ||
指定参数的方式是在原参数名称上加上前缀"readconfig." 如设置`spark.mongodb.input.partitioner`的方式是 `readconfig.spark.mongodb.input.partitioner="MongoPaginateBySizePartitioner"`。如果不指定这些非必须参数,将使用MongoDB官方文档的默认值 | ||
|
||
##### table_name [string] | ||
|
||
从mongoDB获取到的数据,注册成临时表的表名 | ||
|
||
|
||
### Example | ||
|
||
``` | ||
mongodb{ | ||
readconfig.uri="mongodb://myhost:mypost" | ||
readconfig.database="mydatabase" | ||
readconfig.collection="mycollection" | ||
readconfig.spark.mongodb.input.partitioner = "MongoPaginateBySizePartitioner" | ||
table_name = "test" | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 记得把这的大小写改了 |
||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
@waterdropPlugin | ||
@pluginGroup output | ||
@pluginName Kudu | ||
@pluginDesc "写入数据到[Apache Kudu](https://kudu.apache.org)表中" | ||
@pluginAuthor InterestingLab | ||
@pluginHomepage https://interestinglab.github.io/waterdrop | ||
@pluginVersion 1.0.0 | ||
|
||
@pluginOption | ||
string kudu_master yes "kudu的master,多个master以逗号隔开" | ||
string kudu_table yes "kudu中要写入的表名,表必须已经存在" | ||
string mode="insert" no "写入kudu模式 insert|update|upsert|insertIgnore" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
## Output plugin : Kudu | ||
|
||
* Author: InterestingLab | ||
* Homepage: https://interestinglab.github.io/waterdrop | ||
* Version: 1.0.0 | ||
|
||
### Description | ||
|
||
写入数据到[Apache Kudu](https://kudu.apache.org)表中 | ||
|
||
### Options | ||
|
||
| name | type | required | default value | | ||
| --- | --- | --- | --- | | ||
| [kudu_master](#kudu_master-string) | string | yes | - | | ||
| [kudu_table](#kudu_table) | string | yes | - | | ||
| [mode](#mode-string) | string | no | insert | | ||
|
||
|
||
##### kudu_master [string] | ||
|
||
kudu的master,多个master以逗号隔开 | ||
|
||
##### kudu_table [string] | ||
|
||
kudu中要写入的表名,表必须已经存在 | ||
|
||
##### mode [string] | ||
|
||
写入kudu中采取的模式,支持 insert|update|upsert|insertIgnore,默认为insert | ||
insert和insertIgnore :insert在遇见主键冲突将会报错,insertIgnore不会报错,将会舍弃这条数据 | ||
update和upsert :update找不到要更新的主键将会报错,upsert不会,将会把这条数据插入 | ||
|
||
|
||
### Example | ||
|
||
``` | ||
kudu{ | ||
kudu_master="hadoop01:7051,hadoop02:7051,hadoop03:7051" | ||
kudu_table="my_kudu_table" | ||
mode="upsert" | ||
} | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
@waterdropPlugin | ||
@pluginGroup output | ||
@pluginName MongoDB | ||
@pluginDesc "写入数据到[MongoDB](https://www.mongodb.com/)" | ||
@pluginAuthor InterestingLab | ||
@pluginHomepage https://interestinglab.github.io/waterdrop | ||
@pluginVersion 1.0.0 | ||
|
||
@pluginOption | ||
string readConfig.uri yes "mongoDB uri" | ||
string readConfig.database yes "要写入的database" | ||
string readConfig.collection yes "要写入的collection" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
## Output plugin : MongoDB | ||
|
||
* Author: InterestingLab | ||
* Homepage: https://interestinglab.github.io/waterdrop | ||
* Version: 1.0.0 | ||
|
||
### Description | ||
|
||
写入数据到[MongoDB](https://www.mongodb.com/) | ||
|
||
### Options | ||
|
||
| name | type | required | default value | | ||
| --- | --- | --- | --- | | ||
| [writeconfig.uri](#writeconfig.uri-string) | string | yes | - | | ||
| [writeconfig.database](#writeconfig.database-string) | string | yes | - | | ||
| [writeconfig.collection](#writeconfig.collection-string) | string | yes | - | | ||
| [writeconfig.*](#writeconfig.*-string) | string | no | - | | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
|
||
##### writeconfig.uri [string] | ||
|
||
要写入mongoDB的uri | ||
|
||
##### writeconfig.database [string] | ||
|
||
要写入mongoDB的database | ||
|
||
##### writeconfig.collection [string] | ||
|
||
要写入mongoDB的collection | ||
|
||
#### writeconfig | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 改成 |
||
这里还可以配置更多其他参数,详见https://docs.mongodb.com/spark-connector/v1.1/configuration/ | ||
, 参见其中的`Output Configuration`部分 | ||
指定参数的方式是在原参数名称上加上前缀"writeconfig." 如设置`localThreshold`的方式是 `writeconfig.localThreshold=20`。如果不指定这些非必须参数,将使用MongoDB官方文档的默认值 | ||
|
||
|
||
### Example | ||
|
||
``` | ||
mongodb{ | ||
writeconfig.uri="mongodb://myhost:mypost" | ||
writeconfig.database="mydatabase" | ||
writeconfig.collection="mycollection" | ||
} | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 自己的文档都写错了不看吗?用户还怎么看? |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,6 @@ | ||
io.github.interestinglab.waterdrop.input.Fake2 | ||
io.github.interestinglab.waterdrop.input.Hdfs | ||
io.github.interestinglab.waterdrop.input.File | ||
io.github.interestinglab.waterdrop.input.File | ||
io.github.interestinglab.waterdrop.input.Hive | ||
io.github.interestinglab.waterdrop.input.MongoDB | ||
io.github.interestinglab.waterdrop.input.Kudu |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -172,7 +172,7 @@ object Waterdrop extends Logging { | |
println("\t" + key + " => " + value) | ||
}) | ||
|
||
val sparkSession = SparkSession.builder.config(sparkConf).getOrCreate() | ||
val sparkSession = SparkSession.builder.config(sparkConf).enableHiveSupport().getOrCreate() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 当用户不使用hive input的时候,这个是否有副作用 |
||
|
||
// find all user defined UDFs and register in application init | ||
UdfRegister.findAndRegisterUdfs(sparkSession) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readConfig
统一改小写 ->readconfig
options列表里面增加一个:
readconfig.*
,参考一下KafkaStream