View Kafka & Iceberg together

In order to follow this page, it's advised that you run the demo first.

Streambased surfaces your Iceberg and Kafka data together. Let's see this in action by first finding our Kafka and Iceberg data within their respective systems.

The Kafka data

You can see your available Kafka data within AKHQ (a dashboard for Kafka).

Your topics are being populated in real time by Shadowtraffic, and when your 'customers' and 'transactions' topics reach predetermined limits (1,000,000, and 500,000 respectively) they are moved from Kafka to Iceberg.

Once your dashboard looks like this ⬇️

1,000,000 'customers' messages now live in Iceberg.
10,000 'accounts' messages live in Kafka exclusively (no more data inbound).
Messages within 'transactions' are being updated in real time (you'll see this if you refresh the page).

The Iceberg data

Our Iceberg data lives within MinIO (at http://localhost:9001/browser/warehouse/; username: admin; password: password)

The structure of our Iceberg data is such that:

The transactions folder maps onto the Kafka topic of the same name, and it contains historic Kafka data migrated to Iceberg by I.S.K..
So too, customers maps onto the Kafka topic of the same name, but its data lives exclusively in Iceberg.
Branches does not map onto any Kafka topic; it is exclusively an Iceberg table.

View the data together

Now let's head to a Jupyter notebook at this address and see both sets.

Before getting started, make sure to run cells 1–6 below. After running those cells, you should see similar outputs to these for cells 2 and 6:

After this, run the following SQL query:

spark.sql("SHOW DATABASES").show()

At which point you will see these namespaces:

+---------+
|namespace|
+---------+
|   hotset|
|  coldset|
|   merged|
+---------+

A namespace in Iceberg can be understood as a database, and a database in Streambased is a logical view of the data. The database is not a copy of your data — the Kafka data and the Iceberg data stay where they are. The hotset is the view of the data as found in Kafka, the coldset is the view of the data as found in Iceberg, and merged is the view of all the data from both Kafka and Iceberg (including their exclusive topics).

"Hotset" and "coldset" are key terminology in Streambased's products. You can expect to regularly see both terms.

As you can see, Kafka data, Iceberg data, and Kafka & Iceberg data all exist as databases to be queried. Next, let's query each of those databases in turn using PySpark.

Last updated 3 months ago

hashtagThe Kafka data

hashtagThe Iceberg data

hashtagView the data together

The Kafka data

The Iceberg data

View the data together