Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KafkaV2SourceConnector #38748

Merged
merged 25 commits into from
Feb 22, 2024
Merged

KafkaV2SourceConnector #38748

merged 25 commits into from
Feb 22, 2024

Conversation

xinlian12
Copy link
Member

@xinlian12 xinlian12 commented Feb 12, 2024

Feature request #38768

In this PR, we added the kafka CosmosDB source connector v2 version.

High level design

Using Change Feed Pull model

In the V1 version, source connector uses CFP to read the changes (push model). In V2, we have decided to to go for pull model for the following benefits:

  • No extra lease container required.
  • Utilizing kafka native supported offset management mechanism

CosmosDBSourceConnector

This is the major entry class. Few of the most important to be implemented methods will be in this class.

  • taskClass() -> return the source task class name
  • start() -> called when the connector is started. As Dynamic Connector, we need to monitor changes happened on containers (which including containers being added/removed/recreated, containers has split/merge), when changes being detected, task reconfiguration will be requested so new task configs can be recalculated. MetadataMonitorThread is started to monitor such changes.
  • taskConfigs() - return a List of task configurations. Each task will try to copy data from 1...N feed ranges. All the feedRanges from all monitored containers will be allocated to the task buckets (min(feedRanges, maxTasks)) in round-robin fashion. FeedRanges from the same container will be allocated into different task buckets.
Containers to be monitored: container A, container B
Container A feedRanges: FeedRangeA1, FeedRangeA2
Container B feedRanges: FeedRangeB1, FeedRangeB2

maxTasks: 2
Task1: FeedRangeA1, FeedRangeB1
Task2: FeedRangeA2, FeedRangeB2

maxTasks: 4
Task1: FeedRangeA1
Task2: FeedRangeA2
Task3: FeedRangeB1
Task4: FeedRangeB2

Except the data copy task unit, we have to maintain certain necessary metadata - the list of containerRids being monitored, the effective feedRanges for each container being monitored. These metadata will need to be maintained so we can resume from correct offset when split/merge happens. The metadata will be persisted in _cosmos.metadata.topic topic or any topic customer has configured.

The metadata info will be calculated when connector start (either first time initialization or restart). And one of the task will be assigned to write the metadata info to related metadata topic.

CosmosSourceTask

This is the main entry class for kafka task. As mentioned previously, each task will need to pull data for 1...N feedRanges, we are using round-robin fashion to loop through the feed ranges.

Task is asking feedRangeA, feedRange B
poll() -> reading from feedRangeA
poll() -> reading from feedRangeB
poll() -> reading from feedRangeA

Tests

  • Unit tests
  • E2E tests with confluent cloud kafka cluster. CosmosDb container has 10 partitions, with 100K RU, 1M items repopulated. MaxTasks 10.
    image
    image

Config

  • kafka.connect.cosmos.accountEndpoint -> No default. Cosmos DB Account Endpoint Uri
  • kafka.connect.cosmos.accountKey -> No default. Cosmos DB Account Key
  • kafka.connect.cosmos.useGatewayMode -> Default false. Flag to indicate whether to use gateway mode. By default it is false.
  • kafka.connect.cosmos.preferredRegionsList -> Default empty list. Preferred regions list to be used for a multi region Cosmos DB account. This is a comma separated value (e.g., [East US, West US] or East US, West US) provided preferred regions will be used as hint. You should use a collocated kafka cluster with your Cosmos DB account and pass the kafka cluster region as preferred region. See list of azure regions here
  • kafka.connect.cosmos.applicationName -> Default empty string. Will be added as the userAgent suffix.
  • kafka.connect.cosmos.source.database.name -> No default. Cosmos DB database name.
  • kafka.connect.cosmos.source.containers.includeAll -> Default false. Flag to indicate whether reading from all containers.
  • kafka.connect.cosmos.source.containers.includedList -> Default empty list. Containers to be monitored. This config will be ignored if kafka.connect.cosmos.source.includeAllContainers is true.
  • kafka.connect.cosmos.source.containers.topicMap -> Default empty list. A comma delimited list of Kafka topics mapped to Cosmos containers. For example: topic1#con1,topic2#con2. By default, container name is used as the name of the kafka topic to publish data to, can use this property to override the default config
  • kafka.connect.cosmos.source.changeFeed.startFrom -> Default Beginning. ChangeFeed Start from settings (Now, Beginning or a certain point in time (UTC) for example 2020-02-10T14:15:03) - the default value is 'Beginning'.
  • kafka.connect.cosmos.source.changeFeed.mode -> Default LatestVersion. ChangeFeed mode (LatestVersion or AllVersionsAndDeletes).
  • kafka.connect.cosmos.source.changeFeed.maxItemCountHint -> Default 1000. The maximum number of documents returned in a single change feed request. But the number of items received might be higher than the specified value if multiple items are changed by the same transaction.
  • kafka.connect.cosmos.source.metadata.poll.delay.ms -> Default 300000 .
    metadata changes (including container split/merge, adding/removing/recreated containers). When changes are detected, it will reconfigure the tasks. Default is 5 minutes.
  • kafka.connect.cosmos.source.metadata.storage.topic-> Default _cosmos.metadata.topic` . The name of the topic where the metadata are stored. The metadata topic will be created if it does not already exist, else it will use the pre-created topic.
  • kafka.connect.cosmos.source.messageKey.enabled -> Default true . Whether to set the kafka record message key.
  • kafka.connect.cosmos.source.messageKey.field -> Default id. The field to use as the message key.

@azure-sdk
Copy link
Collaborator

azure-sdk commented Feb 14, 2024

API change check

APIView has identified API level changes in this PR and created following API reviews.

com.azure:azure-cosmos
com.azure.cosmos.kafka:azure-cosmos-kafka-connect

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Annie - looks great!

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

Flaky tests: openConnectionsAndInitCachesWithCosmosClient_And_PerContainerConnectionPoolSize_ThroughProactiveContainerInitConfig_WithTimeout

@xinlian12
Copy link
Member Author

/check-enforcer override

@xinlian12 xinlian12 merged commit 30835d9 into Azure:main Feb 22, 2024
69 of 71 checks passed
xinlian12 pushed a commit to xinlian12/azure-sdk-for-java that referenced this pull request Mar 25, 2024
kushagraThapar pushed a commit that referenced this pull request Mar 26, 2024
* Revert "KafkaV2SinkConnector (#38973)"

This reverts commit 6c983ab.

* Revert "UsingTestContainerForKafkaIntegrationTests (#38884)"

This reverts commit 12bec49.

* Revert "KafkaV2SourceConnector (#38748)"

This reverts commit 30835d9.

* revert one more change

* revert change

---------

Co-authored-by: annie-mac <xinlian@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants