Skip to content

Commit

Permalink
fix document
Browse files Browse the repository at this point in the history
  • Loading branch information
chenzy15 committed Jun 28, 2023
1 parent 0451628 commit 069eb47
Showing 1 changed file with 111 additions and 21 deletions.
132 changes: 111 additions & 21 deletions docs/en/connector-v2/source/MongoDB-CDC.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,11 @@

> MongoDB CDC source connector
Support Those Engines
---------------------
## Support Those Engines

> SeaTunnel Zeta<br/>
Key Features
------------
## Key Features

- [ ] [batch](../../concept/connector-v2-features.md)
- [x] [stream](../../concept/connector-v2-features.md)
Expand All @@ -17,23 +15,20 @@ Key Features
- [x] [parallelism](../../concept/connector-v2-features.md)
- [x] [support user-defined split](../../concept/connector-v2-features.md)

Description
-----------
## Description

The MongoDB CDC connector allows for reading snapshot data and incremental data from MongoDB database.

Supported DataSource Info
-------------------------
## Supported DataSource Info

In order to use the Mongodb connector, the following dependencies are required.
In order to use the Mongodb CDC connector, the following dependencies are required.
They can be downloaded via install-plugin.sh or from the Maven central repository.

| Datasource | Supported Versions | Dependency |
|------------|--------------------|-------------------------------------------------------------------------------------------------------------------|
| MongoDB | universal | [Download](https://mvnrepository.com/artifact/org.apache.seatunnel/seatunnel-connectors-v2/connector-cdc-mongodb) |

Availability Settings
---------------------
## Availability Settings

1.MongoDB version: MongoDB version >= 4.0.

Expand Down Expand Up @@ -75,8 +70,7 @@ db.createUser(
);
```
Data Type Mapping
-----------------
## Data Type Mapping
The following table lists the field data type mapping from MongoDB BSON type to Seatunnel data type.
Expand Down Expand Up @@ -108,32 +102,34 @@ For specific types in MongoDB, we use Extended JSON format to map them to Seatun
> 1.When using the DECIMAL type in SeaTunnel, be aware that the maximum range cannot exceed 34 digits, which means you should use decimal(34, 18).<br/>
Source Options
--------------
## Source Options
| Name | Type | Required | Default | Description |
|------------------------------------|--------|----------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| hosts | String | Yes | - | The comma-separated list of hostname and port pairs of the MongoDB servers. eg. `localhost:27017,localhost:27018` |
| username | String | No | - | Name of the database user to be used when connecting to MongoDB. |
| password | String | No | - | Password to be used when connecting to MongoDB. |
| database | String | Yes | - | Name of the database to watch for changes. If not set then all databases will be captured. The database also supports regular expressions to monitor multiple databases matching the regular expression. eg. `db1,db2` |
| collection | String | Yes | - | Name of the collection in the database to watch for changes. If not set then all collections will be captured. The collection also supports regular expressions to monitor multiple collections matching fully-qualified collection identifiers. eg. `db1.coll1,db2.coll2` |
| database | List | Yes | - | Name of the database to watch for changes. If not set then all databases will be captured. The database also supports regular expressions to monitor multiple databases matching the regular expression. eg. `db1,db2` |
| collection | List | Yes | - | Name of the collection in the database to watch for changes. If not set then all collections will be captured. The collection also supports regular expressions to monitor multiple collections matching fully-qualified collection identifiers. eg. `db1.coll1,db2.coll2` |
| connection.options | String | No | - | The ampersand-separated connection options of MongoDB. eg. `replicaSet=test&connectTimeoutMS=300000` |
| batch.size | Long | No | 1024 | The cursor batch size. |
| poll.max.batch.size | Enum | No | 1024 | Maximum number of change stream documents to include in a single batch when polling for new data. |
| poll.await.time.ms | Long | No | 1000 | The amount of time to wait before checking for new results on the change stream. |
| heartbeat.interval.ms | String | No | 0 | The length of time in milliseconds between sending heartbeat messages. Use 0 to disable. |
| incremental.snapshot.chunk.size.mb | Long | No | 64 | The chunk size mb of incremental snapshot. |
| common-options | | No | - | Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details |
**Tips:**
> 1.If the collection changes at a slow pace, it is strongly recommended to set an appropriate value greater than 0 for the heartbeat.interval.ms parameter. When we recover a Seatunnel job from a checkpoint or savepoint, the heartbeat events can push the resumeToken forward to avoid its expiration.<br/>
> 2.MongoDB has a limit of 16MB for a single document. Change documents include additional information, so even if the original document is not larger than 15MB, the change document may exceed the 16MB limit, resulting in the termination of the Change Stream operation.<br/>
> 3.It is recommended to use immutable shard keys. In MongoDB, shard keys allow modifications after transactions are enabled, but changing the shard key can cause frequent shard migrations, resulting in additional performance overhead. Additionally, modifying the shard key can also cause the Update Lookup feature to become ineffective, leading to inconsistent results in CDC (Change Data Capture) scenarios.<br/>
#### example
## How to Create a MongoDB Data Synchronization Jobs
```conf
The following example demonstrates how to create a data synchronization job that reads data from MongoDB and prints it on the local client:
```hocon
env {
# You can set engine configuration here
execution.parallelism = 1
Expand All @@ -144,8 +140,8 @@ env {
source {
MongoDB-CDC {
hosts = "mongo0:27017"
database = "inventory"
collection = "inventory.products"
database = ["inventory"]
collection = ["inventory.products"]
username = stuser
password = stpw
schema = {
Expand All @@ -159,6 +155,34 @@ source {
}
}
# Console printing of the read Mongodb data
sink {
Console {
parallelism = 1
}
}
```
The following example demonstrates how to create a data synchronization job that reads data from MongoDB and cdc write to mysql database:
```hocon
env {
# You can set engine configuration here
execution.parallelism = 1
job.mode = "STREAMING"
execution.checkpoint.interval = 5000
}
source {
MongoDB-CDC {
hosts = "mongo0:27017"
database = ["inventory"]
collection = ["inventory.products"]
username = stuser
password = stpw
}
}
sink {
jdbc {
url = "jdbc:mysql://mysql_cdc_e2e:3306"
Expand All @@ -175,6 +199,72 @@ sink {
}
```
The following example demonstrates how to create a data synchronization job that read the data of multiple library tables mongodb and prints it on the local client:
```hocon
env {
# You can set engine configuration here
execution.parallelism = 1
job.mode = "STREAMING"
execution.checkpoint.interval = 5000
}
source {
MongoDB-CDC {
hosts = "mongo0:27017"
database = ["inventory","crm"]
collection = ["inventory.products","crm.test"]
username = stuser
password = stpw
}
}
# Console printing of the read Mongodb data
sink {
Console {
parallelism = 1
}
}
```
tips:
> 1.The cdc synchronization of multiple library tables cannot specify the schema, and can only output json data downstream.
The following example demonstrates how to create a data synchronization job that through regular expression read the data of multiple library tables mongodb and prints it on the local client:
| Matching example | Expressions | | Describe |
|------------------|-------------|---|----------------------------------------------------------------------------------------|
| Prefix matching | ^(test).* | | Match the database name or table name with the prefix test, such as test1, test2, etc. |
| Suffix matching | .*[p$] | | Match the database name or table name with the suffix p, such as cdcp, edcp, etc. |
```hocon
env {
# You can set engine configuration here
execution.parallelism = 1
job.mode = "STREAMING"
execution.checkpoint.interval = 5000
}
source {
MongoDB-CDC {
hosts = "mongo0:27017"
# So this example is used (^(test).*|^(tpc).*|txc|.*[p$]|t{2}).(t[5-8]|tt),matching txc.tt、test2.test5.
database = ["(^(test).*|^(tpc).*|txc|.*[p$]|t{2})"]
collection = ["(t[5-8]|tt)"]
username = stuser
password = stpw
}
}
# Console printing of the read Mongodb data
sink {
Console {
parallelism = 1
}
}
```
## Changelog
- Add MongoDB CDC Source Connector
Expand Down

0 comments on commit 069eb47

Please sign in to comment.