Programmable 2026 Presentation
Building End-to-End Data Lineage with Kafka, Flink, and Spark
Understanding your data's complete lifecycle is critical in modern data ecosystems. This session provides a comprehensive tutorial on capturing and visualizing data lineage across a production-style data stack.
We will track data from a single Kafka topic as it fans out across multiple, parallel pipelines. These include a Kafka S3 sink connector for raw data archival, a real-time Flink DataStream job for live analytics, and a Flink Table API job that ingests data into an Apache Iceberg table.
From there, we'll follow the lineage as a batch Spark job consumes from the Iceberg table to generate downstream summaries. The entire multi-path lineage graph, including column-level details, will be visualized in the open-source project Marquez.
This solution is built around the careful configuration of OpenLineage, the open standard for data lineage. We will start by establishing lineage from Kafka Connect with a custom single message transform. The session then explores two distinct Flink integration patterns: a straightforward listener-based approach and a more robust manual method. To complete the picture, we will show how to configure Spark to seamlessly consume the Flink job's output, providing a holistic and actionable blueprint for end-to-end data lineage.
Attendees will leave with a clear understanding of how to implement data lineage in their streaming architectures. This enables better data governance, faster root cause analysis, and increased trust in data. This practical guide is essential for data engineers, developers and architects looking to gain true visibility into their data pipelines.