In most data processing pipelines, the source is apache Kafka and I needed to monitor the Kafka consumption status and its lag in an external monitoring system. Therefore, I created this project and I used the following technologies:
- Spark Structured Streaming: scalable and fault-tolerant stream processing engine
- Kafka: message broker and source of the pipeline
- minio: distributed object storage for store processed date
- DeltaLake: an open-source storage framework that enables building a Lakehouse architecture with compute engines Like Spark
- prometheus, prometheus pushgateway and grafana for monitoring system
- bitnami/kafka
- minio/minio
- prom/prometheus
- prom/pushgateway
- grafana/grafana
my spark version was 3.4.1 and delta 2.4 StreamingQueryListener is a new class in spark 3.4.0: https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryListener.html