Jaehyeon Kim

About Jaehyeon

Developer Experience Engineer at Factor House

Jaehyeon leads Developer Experience at Factor House, where he builds tools and platforms that help engineering teams move quickly without sacrificing stability or governance. With expertise in real-time systems and modern data platforms, he has worked extensively with Kafka, Flink, Spark, and the integrated architectures showcased in Factor House Local. As a passionate engineer and writer, he shares practical insights on real-time analytics, data lineage, and building systems that remain resilient, observable, and maintainable at scale.
Header image
Programmable 2026 Presentation

Building End-to-End Data Lineage with Kafka, Flink, and Spark

Melbourne & Sydney
Data, Intelligence & the Developer
Understanding your data's complete lifecycle is critical in modern data ecosystems. This session provides a comprehensive tutorial on capturing and visualizing data lineage across a production-style data stack.

We will track data from a single Kafka topic as it fans out across multiple, parallel pipelines. These include a Kafka S3 sink connector for raw data archival, a real-time Flink DataStream job for live analytics, and a Flink Table API job that ingests data into an Apache Iceberg table.

From there, we'll follow the lineage as a batch Spark job consumes from the Iceberg table to generate downstream summaries. The entire multi-path lineage graph, including column-level details, will be visualized in the open-source project Marquez.

This solution is built around the careful configuration of OpenLineage, the open standard for data lineage. We will start by establishing lineage from Kafka Connect with a custom single message transform. The session then explores two distinct Flink integration patterns: a straightforward listener-based approach and a more robust manual method. To complete the picture, we will show how to configure Spark to seamlessly consume the Flink job's output, providing a holistic and actionable blueprint for end-to-end data lineage.

Attendees will leave with a clear understanding of how to implement data lineage in their streaming architectures. This enables better data governance, faster root cause analysis, and increased trust in data. This practical guide is essential for data engineers, developers and architects looking to gain true visibility into their data pipelines.