View Kafka & Iceberg together
In order to follow this page, it's advised that you run the demo first.
Streambased surfaces your Iceberg and Kafka data together. Let's see this in action by first finding our Kafka and Iceberg data within their respective systems.
The Kafka data
You can see your available Kafka data within AKHQ (a dashboard for Kafka).
Your topics are being populated in real time by Shadowtraffic, and when your 'customers' and 'transactions' topics reach predetermined limits (1,000,000, and 500,000 respectively) they are moved from Kafka to Iceberg.

Once your dashboard looks like this ⬇️

1,000,000 'customers' messages now live in Iceberg.
10,000 'accounts' messages live in Kafka exclusively (no more data inbound).
Messages within 'transactions' are being updated in real time (you'll see this if you refresh the page).
The Iceberg data
Our Iceberg data lives within MinIO (at http://localhost:9001/browser/warehouse/; username: admin; password: password)

The structure of our Iceberg data is such that:
The
transactionsfolder maps onto the Kafka topic of the same name, and it contains historic Kafka data migrated to Iceberg by I.S.K..So too,
customersmaps onto the Kafka topic of the same name, but its data lives exclusively in Iceberg.Branchesdoes not map onto any Kafka topic; it is exclusively an Iceberg table.
View the data together
Now let's head to a Jupyter notebook at this address and see both sets.
Before getting started, make sure to run cells 1–6 below. After running those cells, you should see similar outputs to these for cells 2 and 6:

After this, run the following SQL query:
At which point you will see these namespaces:
A namespace in Iceberg can be understood as a database, and a database in Streambased is a logical view of the data. The database is not a copy of your data — the Kafka data and the Iceberg data stay where they are. The hotset is the view of the data as found in Kafka, the coldset is the view of the data as found in Iceberg, and merged is the view of all the data from both Kafka and Iceberg (including their exclusive topics).
As you can see, Kafka data, Iceberg data, and Kafka & Iceberg data all exist as databases to be queried. Next, let's query each of those databases in turn using PySpark.
Last updated

