Architecture
Last updated
Last updated
Streambased offers a high-performance, SQL-oriented query engine specifically designed for Kafka, outperforming existing solutions like ksqlDB and Apache Flink—especially under workloads where Kafka is traditionally less efficient. By directly leveraging Kafka's native architecture and integrating advanced techniques like indexing, predicate pushdown, and pre-aggregation, Streambased delivers fast, low-latency queries on both historical and real-time data, achieving performance comparable to modern data lakes without the complexity or overhead of traditional data transformation.
Unlike other SQL-on-Kafka platforms, Streambased does not require additional tooling or complex access management infrastructure. It is fully compatible with the Kafka ecosystem, utilizing Kafka's security, schema registry, and authorization components. This integration reduces both complexity and cost, offering a lightweight, scalable solution that allows businesses to seamlessly query Kafka data without needing separate data processing systems or extensive engineering resources.
Real-time Data Access - Streambased enables real-time querying of data in Kafka, allowing analysts to access and analyze data as it streams in, without waiting for batch processes or ETL pipelines.
No ETL Required - Our solution removes the need for complex ETL operations by providing direct access to Kafka data, simplifying your data workflow and accelerating time to insight.
Integration with SQL Tools - Streambased integrates effortlessly with SQL-based analytical tools via JDBC/ODBC, enabling analysts to query Kafka data as they would with traditional databases.
Optimized Performance - Streambased utilizes indexing and query optimization techniques (such as predicate pushdown and pre-aggregation) to accelerate queries on Kafka, delivering up to 30x faster performance for key queries compared to traditional SQL-on-Kafka systems.
The architectural components and interactions within Streambased. The numbered arrows show a typical query flow.
Streambased is a parallel query execution engine that runs separate to the underlying Kafka cluster that provides the input data. Streambased is designed to be horizontally scalable to limits defined only by the underlying Kafka cluster. Streambased interacts with the Kafka cluster using only the public Kafka API and, this means the compute engine (Streambased), is completely decoupled from the storage engine (Kafka).
A Streambased deployment is made up of 2 services. The Streambased Server is responsible for accepting queries from client processes and orchestrating their execution across the cluster. It is also responsible for executing individual query fragments on behalf of other Streambased Servers. The other service in a Streambased cluster is the Index Server, this is responsible for collecting index information from the underlying data and serving it to requesting Streambased Servers. Index data is typically 50-100X smaller in size than the data it represents.
When a new query is accepted it is first planned and optimised by a single Streambased Server, this server becomes the coordinator for this query and assumes responsibility for orchestrating its execution among the other Streambased Servers in the cluster (1).
From here the query is parsed and split into a number of fragments (2 and 3) that can be executed concurrently. Each of these fragments is distributed to a node for execution (4) and these nodes consult Streambased index services to optimise their execution (5). When all fragments are complete the results are returned to the query coordinator (6) and (when requested) back to the client (7).
Note: there is a separate background process (8) that reads directly from Kafka and computes Streambased indexes for later serving for acceleration.
A.S.K. maps common streaming concepts to database specific terms. This mapping means that event streaming data from Apache Kafka can be seamlessly represented in analytical tools. The below table shows these mappings:
Kafka Unit
A.S.K. Unit
Description
Cluster
Database Schema
Each Kafka cluster is represented as a database schema to indicate the separation of resources between them.
Topic
Table
As logical collections of data points both topics and tables are equivalent.
Message
Row
An individual message within a topic can be viewed as a row within a table when combined with a schema to indicate a consistent structure.
Message field
Column
A field within an event streaming message represents a single attribute of a datapoint, A similar concept to a column within a table row in A.S.K.
The optimal performance characteristics for consumers of event streaming systems and analytical tools used to dealing with data lakes are very different. Event streaming systems expect to regularly iterate through a small volume of messages, to process the message content and then to move onto the next batch immediately. Analytical cases typically operate in a much less regular pattern over higher volumes of data.
To address this disparity A.S.K. employs a set of indexing techniques to selectively read only the data required by the query. This includes the following (non exhaustive) set of techniques:
Predicates parsed in the incoming SQL are extracted and recorded and applied, leveraging Streambased indexes, during execution. For instance, given the query:
Streambased would identify the predicate associated with the accountId column, allowing the read tasks spawned from the query to consult any indexes that exist for that column and vastly reduce the amount of data required to be read for the query.
Pre-aggregation is the process of pre-computing aggregates (e.g. max/min/sum/count etc.) for configured sets of grouping columns and aggregate columns. The Streambased engine then parses incoming queries for elements that may benefit from such computations and applies the pre-computed values. For instance, the query:
Would identify accountId as a grouping column and depositAmount as an aggregate column. The execution engine would match an eligible stored element with pre-computed values from the Streambased index rather than computing the values by reading from Kafka, greatly improving query performance.
Streambased is a distributed system that is capable of joining data from multiple Kafka topics (and clusters) in order to achieve analytical results. These joins can involve data movement between nodes. In order to optimise this movement Streambased assesses the volume and shape of fragments to be joined and will move data between nodes only when necessary and according to a number of policies built into the Streambased engine. The result is acceleration beyond the read phase of complex queries.