Skip to content

Latest commit

 

History

History
204 lines (138 loc) · 11 KB

apache-kafka-glossary.md

File metadata and controls

204 lines (138 loc) · 11 KB

Apache Kafka Glossary

Kafka Consumer

A Kafka Consumer is a client application in Apache Kafka that reads (or "consumes") data from Kafka topics. Consumers are part of Kafka’s distributed, publish-subscribe messaging system, enabling downstream applications to process, analyze, or store the data produced by Kafka producers. They play a critical role in the real-time data streaming model, allowing for flexible, high-throughput data processing.

Key Concepts of Kafka Consumers:

  1. Kafka Topics:

    • Topics are logical categories that organize data streams. Producers write data to topics, and consumers subscribe to one or more topics to read data.
    • Each topic is divided into multiple partitions to allow scalability and parallelism.
  2. Consumer Groups:

    • Kafka consumers belong to consumer groups. A consumer group is a group of consumers that work together to read from the partitions of a topic.
    • Each partition in a topic is assigned to exactly one consumer within a consumer group. This ensures that data is consumed in parallel without overlap, enabling scalability.
    • Multiple consumer groups can independently consume the same data stream.
  3. Message Offset:

    • Each record in a Kafka partition has a unique offset that identifies its position.
    • Consumers keep track of offsets to maintain progress through a topic, ensuring they don’t process the same message twice or miss messages after a failure.
    • Offsets can be managed automatically by Kafka or manually by the consumer application, providing flexibility in data processing.
  4. Pull-Based Consumption:

    • Kafka uses a pull model, where consumers repeatedly poll Kafka for new data using the poll() method.
    • This approach allows consumers to control the rate at which they consume data and ensures that they can deal with backpressure when downstream processing gets slower.
  5. Commit Offsets:

    • Consumers commit offsets to Kafka to indicate which messages have been processed successfully.
    • Committing can be done automatically (e.g., at regular intervals) or manually, giving the consumer application more precise control over when offsets are considered processed.
  6. Fault Tolerance:

    • Kafka’s offset management helps consumers recover from failures by allowing them to start from the last committed offset, ensuring reliability and consistency.

How Kafka Consumers Work:

  1. Connect to Kafka Cluster:

    • A consumer establishes a connection to a Kafka broker using configuration properties (like broker addresses and security credentials).
  2. Subscribe to Topic:

    • Consumers subscribe to one or more topics using the subscribe() method to start receiving messages.
  3. Fetch Data:

    • The consumer polls Kafka for records, which returns a batch of messages from the assigned partitions.
  4. Process Records:

    • After fetching records, the consumer processes each message, typically applying some business logic.
  5. Commit Offsets:

    • After successfully processing the batch, the consumer commits the offsets either automatically or manually.

Key Consumer Configurations:

  1. bootstrap.servers:

    • List of Kafka broker addresses used to establish the initial connection.
  2. group.id:

    • Identifier for the consumer group to which the consumer belongs. All consumers with the same group.id will share the work of consuming messages from the topic.
  3. key.deserializer / value.deserializer:

    • Classes that deserialize the message key and value into readable formats (e.g., String, JSON).
  4. enable.auto.commit:

    • When set to true, Kafka will automatically commit offsets periodically, indicating that messages have been processed.
  5. auto.offset.reset:

    • Specifies the behavior when a committed offset is not found:
      • earliest: Read from the beginning of the topic.
      • latest: Read from the most recent messages.

Benefits of Kafka Consumers:

  1. Scalability:

    • Adding more consumers in the same consumer group enables parallel processing by rebalancing partitions among the group members.
  2. Fault Tolerance:

    • Kafka’s offset tracking ensures that, in the event of a failure, consumers can resume from the last committed offset.
  3. Flexible Delivery Semantics:

    • Kafka provides at-least-once delivery by default, but exactly-once semantics can be achieved when used in combination with certain configurations or Kafka Streams.
  4. Efficient Data Processing:

    • Kafka consumers efficiently handle high-throughput use cases, making them ideal for real-time stream processing.

Common Use Cases for Kafka Consumers:

  1. Real-Time Stream Processing:

    • Processing data streams in real time for analytics, monitoring, or alerting purposes.
  2. Log Aggregation:

    • Consuming logs from Kafka and forwarding them to a centralized storage system for monitoring or analysis.
  3. ETL Pipelines:

    • Extracting, transforming, and loading data from Kafka to databases, data warehouses, or other destinations.
  4. Event-Driven Applications:

    • Building event-driven microservices that respond to specific events published to Kafka topics.

Kafka consumers are a vital part of Apache Kafka’s event streaming architecture, enabling applications to process and react to streams of data in real time, efficiently, and at scale.

Kafka Producer

A Kafka Producer is a client application in Apache Kafka that sends (or "produces") messages to Kafka topics. Producers are responsible for publishing data into Kafka, forming the initial part of Kafka’s distributed, real-time data streaming and event processing system.

Key Concepts of Kafka Producers:

  1. Kafka Topics:

    • Topics are logical channels to which producers send messages.
    • Each topic is divided into partitions to distribute messages and enable parallelism.
    • Producers write to specific partitions of a topic, which helps with load balancing and scaling.
  2. Partitions and Keys:

    • Partitions allow Kafka to scale horizontally. Each message is assigned to a partition, and Kafka ensures that all messages within a partition maintain a strict order.
    • The key of a message is used to determine which partition the message will be sent to. Messages with the same key are always sent to the same partition.
    • If no key is provided, the producer uses a round-robin strategy to distribute messages across partitions.
  3. Message Format:

    • Each message in Kafka consists of a key, value, and optional headers.
    • The key is often used to ensure message ordering by assigning it to a specific partition.
    • The value contains the actual data being transmitted.
  4. Acks (Acknowledgments):

    • Kafka producers have an acknowledgment setting (acks) that controls the durability of sent messages:
      • acks=0: No acknowledgment is received. The producer does not wait for broker confirmation (fast but not durable).
      • acks=1: The leader partition acknowledges once the message is written (default, offers a balance between durability and performance).
      • acks=all (or acks=-1): All in-sync replicas must acknowledge the message, ensuring the highest durability.
  5. Serialization:

    • Producers serialize the key and value before sending messages to Kafka. Serialization converts the data into a byte format that Kafka can store.
    • Examples: StringSerializer, ByteArraySerializer, or custom serializers.

Features of Kafka Producers:

  1. High Throughput:

    • Kafka producers are designed to handle high throughput, efficiently sending millions of messages per second.
  2. Batching:

    • Producers batch multiple messages together to send them in a single request, which reduces network overhead and increases throughput.
    • batch.size and linger.ms configurations control batching behavior.
  3. Compression:

    • Kafka producers support compressing messages (e.g., using gzip, snappy, lz4, or zstd). Compression reduces the size of data sent to Kafka, improving throughput by minimizing network usage.
  4. Retries and Idempotence:

    • Retries: Producers retry sending messages if there is a transient failure (e.g., network issues).
    • Idempotent Producers (enable.idempotence=true): Ensures exactly-once delivery to Kafka. The broker de-duplicates duplicate messages to prevent data loss or duplication caused by retries.

How Kafka Producers Work:

  1. Connect to Kafka:

    • The producer connects to a Kafka cluster using the bootstrap.servers configuration, which specifies the broker addresses.
  2. Serialization:

    • The producer serializes the key and value of each message using configured serializers.
  3. Partition Assignment:

    • The producer assigns the message to a partition based on the message key (if provided), round-robin, or a custom partitioning strategy.
  4. Batching and Sending:

    • Messages are batched and then sent to the broker, with a request for acknowledgment based on the acks setting.
  5. Receive Acknowledgment:

    • Depending on the acks configuration, the broker sends an acknowledgment to the producer, confirming that the message has been successfully stored.

Key Configurations for Kafka Producers:

  1. bootstrap.servers:

    • A list of Kafka broker addresses used to establish the initial connection.
  2. key.serializer / value.serializer:

    • Classes that serialize the key and value into byte arrays for transmission to Kafka.
  3. acks:

    • Controls the acknowledgment behavior (0, 1, or all).
  4. compression.type:

    • Compresses messages to reduce network usage and increase throughput (none, gzip, snappy, lz4, zstd).
  5. linger.ms:

    • Adds a delay before sending messages to allow more records to be batched together. This helps improve throughput.
  6. batch.size:

    • Sets the maximum size (in bytes) of a batch of messages to send.
  7. enable.idempotence:

    • Guarantees exactly-once delivery by preventing duplicate messages due to retries.

Use Cases for Kafka Producers:

  1. Event Streaming:

    • Producers send events (e.g., user actions, system logs) to Kafka for real-time processing or analysis.
  2. Log Aggregation:

    • Producers collect logs from distributed systems and send them to Kafka for centralized analysis.
  3. Metrics Collection:

    • Producers gather metrics from applications or devices and send them to Kafka for monitoring and alerting.
  4. ETL Pipelines:

    • Producers send data extracted from databases or other systems into Kafka as part of an Extract, Transform, Load (ETL) workflow.
  5. Data Integration:

    • Kafka producers serve as the starting point for integrating disparate data sources into a unified data pipeline.

Kafka producers play a vital role in Kafka's event streaming architecture, providing a reliable, scalable, and high-throughput way of feeding data into Kafka for real-time processing, analysis, and storage.