Indexing Kafka Data

Streambased creates indexes over Kafka data that it uses to massively accelerate analytical queries. Index creation happens in a pipeline consisting of the following configurable components.

Sources

A source represents a Kafka cluster from which data can be accessed. Streambased supports queries that span Kafka clusters and so sources should be named according to a convention that can easily be applied in SQL queries. For instance, in A.S.K. (Analytics service for Kafka), every source is represented as a database schema allowing easy switching and joining between them.

Extractor

An extractor is responsible for mapping fields within a Kafka message to fields that are indexed. Mapping logic can be custom designed by implementing a simple Java interface but some of the more common mapping cases are provided out of the box:

io.streambased.index.extractor.HeaderFieldsExtractor - Maps fields from Kafka message headers io.streambased.index.extractor.JsonKeyFieldsExtractor - Maps fields from Json Kafka message keys io.streambased.index.extractor.JsonValueFieldsExtractor - Maps fields from Json Kafka message values

Transformer

In some cases a straight mapping of message field to indexed value is not the most efficient way to store data. For instance, timestamp granularity is usually stored at millisecond or finer but this granularity is rarely required for analytics. It doesn't make sense to store every millisecond as an index value, instead 1 value per minute/hour/day will suffice.

To address this, Streambased has the concept of transformers, functions that are applied to extracted values before they are stored in the indexes.

Aggregators

Aggregators compute aggregate values on sections of Kafka messages that can be used to accelerate analytical queries. For instance, an aggregator may pre-compute the sum of a numeric field in Kafka so that the query SELECT SUM(field) FROM table; does not need to read messages to complete.

For more information on configuring these for your data please see the guide here

Last updated