# Query across Kafka & Iceberg

In order to follow this page, it's advised that you [run the demo](/get-started/run-the-demo.md) first.

In the last section we showed Kafka data and Iceberg data living separately, and three **logical views**: a view of Kafka, a view of Iceberg and a view of Kafka & Iceberg. Each of those **logical views** allows us to query the dataset using SQL.

### The hotset

Within the Jupyter notebook, if we execute the SQL query below we can see the tables in the `hotset` database. We're going to surface Kafka data in an Iceberg table format.

```python
spark.sql("SHOW TABLES IN hotset").show()
```

Running this we can see:

```
+---------+------------+-----------+
|namespace|   tableName|isTemporary|
+---------+------------+-----------+
|   hotset|    accounts|      false|
|   hotset|   customers|      false|
|   hotset|transactions|      false|
+---------+------------+-----------+
```

The rows of our `tableName` column match the different topics in our Kafka cluster (again, found on [AKHQ](http://localhost:9090/)). &#x20;

Let's see five messages from the transactions topic by running:

```python
spark.sql("SELECT * FROM isk.hotset.transactions LIMIT 5").show()
```

You are now looking at a Kafka topic **viewed as an Iceberg table.**

```
+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+
|       TransactionID|AccountID|TransactionType|TransactionAmount|BranchID|CustomerFlaggedFraud|     TransactionTime|kafka_partition|kafka_offset|            kafka_ts|
+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+
|20391fe0-0237-235...|     5070|     Withdrawal|          5534.48|     129|               false| 2025-10-23 12:34:46|              0|      500000|2025-11-28 13:09:...|
|dfba7354-8e95-9a6...|     8905|     Withdrawal|          3628.58|     116|               false|2025-10-23 12:34:...|              0|      500001|2025-11-28 13:09:...|
|5e9fde7d-5a1c-d3b...|     7707|        Deposit|           257.14|     121|                true|2025-10-23 12:34:...|              0|      500002|2025-11-28 13:09:...|
|c4ba5a1b-0828-277...|     1997|        Deposit|          8249.96|     126|                true|2025-10-23 12:34:...|              0|      500003|2025-11-28 13:09:...|
|f704e321-f062-04c...|     9343|        Deposit|          7337.52|     130|                true|2025-10-23 12:34:...|              0|      500004|2025-11-28 13:09:...|
+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+


```

{% hint style="info" %}
Note: as this demo uses live generated data your Kafka messages will differ from those displayed throughout this demo. The metadata does not change, however, nor does the `branches` table within Iceberg.
{% endhint %}

The view-based approach took the query, identified the relevant Kafka data, and served it back in an Iceberg-friendly format.&#x20;

We can also see the Kafka topic's schema by describing the `hotset`:

```python
spark.sql("DESCRIBE isk.hotset.transactions").show();
```

```
+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|       TransactionID|              string|   NULL|
|           AccountID|              bigint|   NULL|
|     TransactionType|              string|   NULL|
|   TransactionAmount|              double|   NULL|
|            BranchID|              bigint|   NULL|
|CustomerFlaggedFraud|             boolean|   NULL|
|     TransactionTime|       timestamp_ntz|   NULL|
|     kafka_partition|                 int|   NULL|
|        kafka_offset|              bigint|   NULL|
|            kafka_ts|       timestamp_ntz|   NULL|
|                    |                    |       |
|      # Partitioning|                    |       |
|              Part 0|     kafka_partition|       |
|              Part 1|truncate(1000, ka...|       |
+--------------------+--------------------+-------+

```

Notice the additional metadata:

```
+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|      ...........   |      ...........   | ..... |
|     kafka_partition|                 int|   NULL|
|        kafka_offset|              bigint|   NULL|
|            kafka_ts|       timestamp_ntz|   NULL|
|      ...........   |      ...........   | ..... |
+--------------------+--------------------+-------+
```

Kafka metadata is also available in Iceberg tables, giving you greater visibility over your data.

### The coldset

Remember, our Iceberg data currently lives within MinIO, and as expected we can see these tables too:

```python
spark.sql("SHOW TABLES IN coldset").show()
```

```
+---------+------------+-----------+
|namespace|   tableName|isTemporary|
+---------+------------+-----------+
|  coldset|    branches|      false|
|  coldset|   customers|      false|
|  coldset|transactions|      false|
+---------+------------+-----------+

```

The `branches` table has no corresponding Kafka topic; it exclusively lives in MinIO, along with the other folders:

<figure><img src="/files/7ymE3tGUhXdW252XJoX3" alt=""><figcaption><p>From the previous page</p></figcaption></figure>

### Merged

Now, let's look at the data merged from both Kafka and Iceberg. To begin, we run the following query:

```python
spark.sql("SHOW TABLES IN merged").show()
```

We'll see:

```
+---------+------------+-----------+
|namespace|   tableName|isTemporary|
+---------+------------+-----------+
|   merged|    accounts|      false|  -> exclusively from Kafka
|   merged|    branches|      false|  -> exclusively from Iceberg
|   merged|   customers|      false|
|   merged|transactions|      false|
+---------+------------+-----------+
```

Looking at the message count is another excellent way of visualising the combined data. Run these three queries in one cell:

```python
spark.sql("""
SELECT 'coldset', COUNT(*) FROM isk.coldset.transactions WHERE TransactionTime < now()
""").show()
spark.sql("""
SELECT 'merged', COUNT(*) FROM isk.merged.transactions WHERE TransactionTime < now()
""").show()
spark.sql("""
SELECT 'hotset', COUNT(*) FROM isk.hotset.transactions WHERE TransactionTime < now()
""").show()
```

First, we see how many transactions have been stored in the coldset:

```
+-------+--------+
|coldset|count(1)|
+-------+--------+
|coldset|  500000|
+-------+--------+
```

Second, we see how many messages we have when we merge the hotset with the coldset:

```
+------+--------+
|merged|count(1)|
+------+--------+
|merged|  521650|
+------+--------+

```

The difference between the coldset and merged represents the number of messages stored in Kafka. The final output shows us the contents of the hotset:

```
+------+--------+
|hotset|count(1)|
+------+--------+
|hotset|   21660|
+------+--------+
```

{% hint style="info" %}
Note the difference of 10 between 21,650 and 21,660. As this is a real-time demo, 10 new messages arrived in between the second and third SQL queries! Your own numbers will differ, but will demonstrate the same point.
{% endhint %}

To round this off, let's display 10 messages from the transactions topic within the merged table:

```python
spark.sql("SELECT * FROM isk.merged.transactions LIMIT 10").show()
```

```
+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+
|       TransactionID|AccountID|TransactionType|TransactionAmount|BranchID|CustomerFlaggedFraud|     TransactionTime|kafka_partition|kafka_offset|            kafka_ts|
+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+
|091e7ce6-fbf0-fa1...|     3339|        Deposit|          6194.14|     101|               false| 2024-02-25 06:40:46|              0|      334000|2025-12-03 14:08:...|
|c6fefdcd-24b0-813...|      347|     Withdrawal|          9852.57|     119|                true|2024-02-25 06:46:...|              0|      334001|2025-12-03 14:08:...|
|5fb0d413-a08b-3e7...|     9827|     Withdrawal|          6642.24|     124|               false|2024-02-25 06:51:...|              0|      334002|2025-12-03 14:08:...|
|3cc70379-f729-862...|     2445|     Withdrawal|          5428.94|     111|               false|2024-02-25 06:56:...|              0|      334003|2025-12-03 14:08:...|
|05133492-ed6e-86e...|      842|     Withdrawal|          3319.01|     102|               false|2024-02-25 07:01:...|              0|      334004|2025-12-03 14:08:...|
|29320372-8f39-b77...|     3691|     Withdrawal|          7050.36|     100|               false|2024-02-25 07:07:...|              0|      334005|2025-12-03 14:08:...|
|a50ec7bb-9dd1-848...|     6670|        Deposit|           924.18|     117|                true|2024-02-25 07:12:...|              0|      334006|2025-12-03 14:08:...|
|8c158d96-53fc-c6b...|     4762|        Deposit|          7988.16|     121|               false|2024-02-25 07:17:...|              0|      334007|2025-12-03 14:08:...|
|372e79ef-21e3-891...|     6386|     Withdrawal|          1908.83|     107|               false|2024-02-25 07:22:...|              0|      334008|2025-12-03 14:08:...|
|3b97212e-4d4e-6d1...|     4946|        Deposit|           445.53|     112|               false|2024-02-25 07:28:...|              0|      334009|2025-12-03 14:08:...|
+--------------------+---------+---------------+-----------------+--------+--------------------+--------------------+---------------+------------+--------------------+
```

Above, we have seen how Streambased enables you to have a unified view of your data, combining live Kafka messages with historic Iceberg data.

But you'll likely also want to make dashboards with this data. That is easily achieved with Streambased too, as we'll explore next.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.streambased.io/get-started/query-across-kafka-and-iceberg.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
