Query across Kafka & Iceberg

In order to follow this page, it's advised that you run the demo first.

In the last section we showed Kafka data and Iceberg data living separately, and three logical views: a view of Kafka, a view of Iceberg and a view of Kafka & Iceberg. Each of those logical views allows us to query the dataset using SQL.

The hotset

Within the Jupyter notebook, if we execute the SQL query below we can see the tables in the hotset database. We're going to surface Kafka data in an Iceberg table format.

spark.sql("SHOW TABLES IN hotset").show()

Running this we can see:

+---------+------------+-----------+
|namespace|   tableName|isTemporary|
+---------+------------+-----------+
|   hotset|    accounts|      false|
|   hotset|   customers|      false|
|   hotset|transactions|      false|
+---------+------------+-----------+

The rows of our tableName column match the different topics in our Kafka cluster (again, found on AKHQ).

Let's see five messages from the transactions topic by running:

spark.sql("SELECT * FROM isk.hotset.transactions LIMIT 5").show()

You are now looking at a Kafka topic viewed as an Iceberg table.

+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+
|       TransactionID|AccountID|TransactionType|TransactionAmount|BranchID|CustomerFlaggedFraud|     TransactionTime|kafka_partition|kafka_offset|            kafka_ts|
+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+
|20391fe0-0237-235...|     5070|     Withdrawal|          5534.48|     129|               false| 2025-10-23 12:34:46|              0|      500000|2025-11-28 13:09:...|
|dfba7354-8e95-9a6...|     8905|     Withdrawal|          3628.58|     116|               false|2025-10-23 12:34:...|              0|      500001|2025-11-28 13:09:...|
|5e9fde7d-5a1c-d3b...|     7707|        Deposit|           257.14|     121|                true|2025-10-23 12:34:...|              0|      500002|2025-11-28 13:09:...|
|c4ba5a1b-0828-277...|     1997|        Deposit|          8249.96|     126|                true|2025-10-23 12:34:...|              0|      500003|2025-11-28 13:09:...|
|f704e321-f062-04c...|     9343|        Deposit|          7337.52|     130|                true|2025-10-23 12:34:...|              0|      500004|2025-11-28 13:09:...|
+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+

Note: as this demo uses live generated data your Kafka messages will differ from those displayed throughout this demo. The metadata does not change, however, nor does the branches table within Iceberg.

The view-based approach took the query, identified the relevant Kafka data, and served it back in an Iceberg-friendly format.

We can also see the Kafka topic's schema by describing the hotset:

spark.sql("DESCRIBE isk.hotset.transactions").show();

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|       TransactionID|              string|   NULL|
|           AccountID|              bigint|   NULL|
|     TransactionType|              string|   NULL|
|   TransactionAmount|              double|   NULL|
|            BranchID|              bigint|   NULL|
|CustomerFlaggedFraud|             boolean|   NULL|
|     TransactionTime|       timestamp_ntz|   NULL|
|     kafka_partition|                 int|   NULL|
|        kafka_offset|              bigint|   NULL|
|            kafka_ts|       timestamp_ntz|   NULL|
|                    |                    |       |
|      # Partitioning|                    |       |
|              Part 0|     kafka_partition|       |
|              Part 1|truncate(1000, ka...|       |
+--------------------+--------------------+-------+

Notice the additional metadata:

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|      ...........   |      ...........   | ..... |
|     kafka_partition|                 int|   NULL|
|        kafka_offset|              bigint|   NULL|
|            kafka_ts|       timestamp_ntz|   NULL|
|      ...........   |      ...........   | ..... |
+--------------------+--------------------+-------+

Kafka metadata is also available in Iceberg tables, giving you greater visibility over your data.

The coldset

Remember, our Iceberg data currently lives within MinIO, and as expected we can see these tables too:

spark.sql("SHOW TABLES IN coldset").show()

+---------+------------+-----------+
|namespace|   tableName|isTemporary|
+---------+------------+-----------+
|  coldset|    branches|      false|
|  coldset|   customers|      false|
|  coldset|transactions|      false|
+---------+------------+-----------+

The branches table has no corresponding Kafka topic; it exclusively lives in MinIO, along with the other folders:

Merged

Now, let's look at the data merged from both Kafka and Iceberg. To begin, we run the following query:

spark.sql("SHOW TABLES IN merged").show()

We'll see:

+---------+------------+-----------+
|namespace|   tableName|isTemporary|
+---------+------------+-----------+
|   merged|    accounts|      false|  -> exclusively from Kafka
|   merged|    branches|      false|  -> exclusively from Iceberg
|   merged|   customers|      false|
|   merged|transactions|      false|
+---------+------------+-----------+

Looking at the message count is another excellent way of visualising the combined data. Run these three queries in one cell:

spark.sql("""
SELECT 'coldset', COUNT(*) FROM isk.coldset.transactions WHERE TransactionTime < now()
""").show()
spark.sql("""
SELECT 'merged', COUNT(*) FROM isk.merged.transactions WHERE TransactionTime < now()
""").show()
spark.sql("""
SELECT 'hotset', COUNT(*) FROM isk.hotset.transactions WHERE TransactionTime < now()
""").show()

First, we see how many transactions have been stored in the coldset:

+-------+--------+
|coldset|count(1)|
+-------+--------+
|coldset|  500000|
+-------+--------+

Second, we see how many messages we have when we merge the hotset with the coldset:

+------+--------+
|merged|count(1)|
+------+--------+
|merged|  521650|
+------+--------+

The difference between the coldset and merged represents the number of messages stored in Kafka. The final output shows us the contents of the hotset:

+------+--------+
|hotset|count(1)|
+------+--------+
|hotset|   21660|
+------+--------+

Note the difference of 10 between 21,650 and 21,660. As this is a real-time demo, 10 new messages arrived in between the second and third SQL queries! Your own numbers will differ, but will demonstrate the same point.

To round this off, let's display 10 messages from the transactions topic within the merged table:

spark.sql("SELECT * FROM isk.merged.transactions LIMIT 10").show()

+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+
|       TransactionID|AccountID|TransactionType|TransactionAmount|BranchID|CustomerFlaggedFraud|     TransactionTime|kafka_partition|kafka_offset|            kafka_ts|
+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+
|091e7ce6-fbf0-fa1...|     3339|        Deposit|          6194.14|     101|               false| 2024-02-25 06:40:46|              0|      334000|2025-12-03 14:08:...|
|c6fefdcd-24b0-813...|      347|     Withdrawal|          9852.57|     119|                true|2024-02-25 06:46:...|              0|      334001|2025-12-03 14:08:...|
|5fb0d413-a08b-3e7...|     9827|     Withdrawal|          6642.24|     124|               false|2024-02-25 06:51:...|              0|      334002|2025-12-03 14:08:...|
|3cc70379-f729-862...|     2445|     Withdrawal|          5428.94|     111|               false|2024-02-25 06:56:...|              0|      334003|2025-12-03 14:08:...|
|05133492-ed6e-86e...|      842|     Withdrawal|          3319.01|     102|               false|2024-02-25 07:01:...|              0|      334004|2025-12-03 14:08:...|
|29320372-8f39-b77...|     3691|     Withdrawal|          7050.36|     100|               false|2024-02-25 07:07:...|              0|      334005|2025-12-03 14:08:...|
|a50ec7bb-9dd1-848...|     6670|        Deposit|           924.18|     117|                true|2024-02-25 07:12:...|              0|      334006|2025-12-03 14:08:...|
|8c158d96-53fc-c6b...|     4762|        Deposit|          7988.16|     121|               false|2024-02-25 07:17:...|              0|      334007|2025-12-03 14:08:...|
|372e79ef-21e3-891...|     6386|     Withdrawal|          1908.83|     107|               false|2024-02-25 07:22:...|              0|      334008|2025-12-03 14:08:...|
|3b97212e-4d4e-6d1...|     4946|        Deposit|           445.53|     112|               false|2024-02-25 07:28:...|              0|      334009|2025-12-03 14:08:...|
+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+

Above, we have seen how Streambased enables you to have a unified view of your data, combining live Kafka messages with historic Iceberg data.

But you'll likely also want to make dashboards with this data. That is easily achieved with Streambased too, as we'll explore next.

Last updated 2 months ago

hashtagThe hotset

hashtagThe coldset

hashtagMerged

The hotset

The coldset

Merged