Architecture

As explored in the overview, I.S.K. (Iceberg Service for Kafka) serves Kafka data as Iceberg tables, combining real-time data and historical data. In this page you'll discover: how Iceberg and Kafka concepts are mapped in I.S.K, its high-level architecture, and how so it uses indexing for performance optimisation.

Materialised at runtime, I.S.K. allows for the decoupling of the physical structure of the data from the logical, table-based representation. With I.S.K. you can:

Access up-to-date (zero latency) data from analytical applications.
Impose structure on read in your analytical applications.
Logically repartition with no physical data movement.
Surface multiple “views” on the data with different characteristics defined at runtime.
Take advantage of analytical practices such as indexing, column statistics etc. to achieve massive performance enhancements.

Stream table duality

I.S.K. maps common streaming concepts to Iceberg specific terms. This mapping means that event streaming data from Apache Kafka can be seamlessly represented in the tools and systems that consume Iceberg data. The below table shows these mappings:

Kafka Unit

I.S.K. Unit

Description

Cluster

Namespace

Each Kafka cluster is represented as an Iceberg namespace to indicate the separation of resources between them.

Topic

Table

As logical collections of data points, both topics and tables are equivalent.

Message

Row

An individual message within a topic can be viewed as a row within a table when combined with a schema to indicate a consistent structure.

Message field

Column

A Kafka message field is treated as a column within an Iceberg table, once again where a schema is provided.

High level architecture

A good understanding of our architecture requires knowledge of Apache Iceberg fundamentals; you can find an explanation of the fundamentals here.

Here are the steps which run from indexing to querying, and through to your analytical application:

Prior to a query being ran, I.S.K indexes data in the streaming system - maintaining an up-to-date index of events.

When an Iceberg query is executed, i.e.

SQL SELECT … FROM DataTopic WHERE customer_name=’X’;

First: The analytical application queries I.S.K. for table metadata.

Second: I.S.K. generates the table metadata at runtime by querying the Streambased indexer service.

Third: I.S.K. returns table metadata back to your analytical application.

Table metadata returned in the 2nd step provides the locations of manifest files (S3 paths).

First: The analytical application reads the manifest files directly from the storage service.

Second: I.S.K. generates manifest files with partitions / splits to enable efficient data pruning applicable to the specific query.

Third: I.S.K. returns the generated manifests back to the analytical application.

The analytical application reads data files from storage based upon the manifest files received above.

First: I.S.K. fetches data from the source streaming system - applying skips / splits to read only the data matching the query predicate.

Second: I.S.K. aggregates individual events into batch data files in memory.

Third: I.S.K. returns events batched as data files back to your analytical application.

Indexing

In order to operate optimally, Kafka and Apache Iceberg process data at significantly different intervals. Kafka iterates through a small volume of messages at intervals measured in milliseconds, processing their content and moving onto the next batch immediately. Meanwhile, the systems that consume Apache Iceberg tables batch process at intervals ranging from 24 hours to a couple of minutes.

To address this disparity in processing times, I.S.K. employs a set of indexing techniques to selectively read only the data required by the query. For instance, given the predicate “WHERE customer_name = ‘X’; I.S.K. indexing drastically reduces the number of messages to be read from the underlying event stream by pruning the read requirement to only messages that contain the required field value.

Note: this is far from an exhaustive account of the performance enhancing techniques employed by I.S.K., and our capabilities are evolving as new use cases are addressed.

PreviousOverview NextConfigurations

Last updated 1 day ago