The HBaseReader is for reading a HBase table into a dataframe in batch mode.
-
The following connection properties are required to be able to connect to a HBase environment
- connection.hbase.zookeeper.quorum: a comma-separated list of hosts on which ZooKeeper servers are running
- connection.hbase.zookeeper.property.clientPort: the port number at which ZooKeeper servers are running on each host
- connection.zookeeper.znode.parent: the root znode that will contain all the znodes created/used by HBase
- connection.core-site.xml: the core-site configuration
- connection.hdfs-site.xml: the hdfs-site configuration
- connection.hbase-site.xml: the hbase-site configuration
Note: if option 6 is configured, then option 1, 2, 3 are ignored.
-
The following two options are mandatory in secured hbase environment 7. connection.hadoop.security.authentication: it must be the value of kerberos 8. connection.hbase.security.authentication: it must be the value of kerberos
-
The table property specifies the target HBase table to which the data will be writing.
-
The following defines the data mapping of columns from a source HBase table to columns of a dataframe:
columnsMapping: dst-col0: "__:rowKey" dst-col1: "col-family1:src-col1" dst-col2: "col-family2:src-col2" ...
Where:
- dst-col0 in the target dataframe is from the row-key in the source HBase table
- dst-col1, dst-col2 ... in the target dataframe are from the col-family1:src-col1, col-family2:src-col2, ... columns in the source HBase table.
-
The options property defines the reading behavior.
- maxResultSize: set the maximum result size. The default is -1; this means that no specific maximum result size will be set for this scan, and the global configured value will be used instead. (Defaults to unlimited).
- batchSize: set the maximum number of values to return for each call.
- isolationLevel: It must be either READ_COMMITTED or READ_UNCOMMITED.
- numPartitions: the number of parititions for the output dataframe
-
The filter defined what rows will be read from the source.
- keyStart: the start value of the row key
- keyStop: the end value of the row key
- keyPrefix: the prefix of the row key
- tsStart: the start timestamp of rows
- tsEnd: the end timestamp of rows
Actor Class: com.qwshen.etl.source.HBaseReader
The definition of HBaseWriter:
- In YAML format
actor:
type: hbase-reader
properties:
connection:
hbase.zookeeper.quorum: 127.0.0.1
hbase.zookeeper.property.clientPort: 2181
zookeeper.znode.parent: "/hbase-unsecure"
hadoop.security.authentication: kerberos
hbase.security.authentication: kerberos
table: "events_db:users"
columnsMapping:
user_id: "__:rowKey"
birthyear: "profile:birth_year"
gender: "profile:gender"
address: "location:address"
options:
numPartitions: 16
batchSize: 900
filter:
tsStart: "2020-01-01 00:00:00"
tsEnd: "2020-12-31 23:59:59"
- In JSON format
{
"actor": {
"type": "hbase-reader",
"properties": {
"connection": {
"core-site.xml": "/usr/hdp/current/hbase-server/conf/core-site.xml",
"hdfs-site.xml": "/usr/hdp/current/hbase-server/conf/hdfs-site.xml",
"hbase-site.xml": "/usr/hdp/current/hbase-server/conf/hbase-site.xml"
},
"table": "events_db:users",
"columnsMapping": {
"user_id": "__:rowKey",
"birthyear": "profile:birth_year",
"gender": "profile:gender",
"address": "location:address"
},
"options": {
"numPartitions": "16",
"batchSize": "9000",
"isolationLevel": "READ_COMMITTED"
},
"filter": {
"tsStart": "2020-01-01 00:00:00",
"tsEnd": "2020-12-31 23:59:59"
}
}
}
}
- In XML format
<actor type="hbase-reader">
<properties>
<connection>
<core-site.xml>/usr/hdp/current/hbase-server/conf/core-site.xml</core-site.xml>
<hdfs-site.xml>/usr/hdp/current/hbase-server/conf/hdfs-site.xml</hdfs-site.xml>
<hbase-site.xml>/usr/hdp/current/hbase-server/conf/hbase-site.xml</hbase-site.xml>
</connection>
<table>events_db:users</table>
<columnsMapping>
<user_id>__:rowKey</user_id>
<birthyear>profile:birth_year</birthyear>
<gender>profile:gender</gender>
<address>location:address</address>
</columnsMapping>
<options>
<numPartitions>16</numPartitions>
<batchSize>9000</batchSize>
</options>
<filter>
<tsStart>2020-01-01 00:00:00</tsStart>
</filter>
</properties>
</actor>