Streambased Acceleration

At the core of Streambased is an indexing and acceleration technology that allows 30-100x improvements on analytical queries. Streambased indexes Kafka data in a number of ways for a number of common query patterns but here are the most popular

Filtering

The most common analytical query is one that selects a subset of data from a larger set. A set of criteria are supplied to the query (e.g. via the SQL WHERE clause) and records that do not match this criteria are not included in the results.

Most Kafka approaches to this problem follow the “read and drop” pattern where all records are read and the ones that don’t match the criteria are dropped from the results. This is extremely inefficient, can we do better?

There is an indexing structure ideally suited to this problem: the bloom filter. Bloom filters test whether a particular value is present in a set. They provide a fast and space efficient way to answer the following questions:

Does the set definitely not contain the value probed (answered with 100% certainty)?
Could the set contain the value (allowing for some false positives)?

Imagine a bloom filter over a selection of Kafka offsets (say offsets 0-1000 for partition 0). We could test this to see if it contains our search values and if the test dictates it does not, we would not read it.

Applying this practice to a simple predicate query yields massively reduced read requirements. In the extreme case where only one Kafka message matches our criteria the bloom filters reduce the read requirement to only the messages covered by the single filter that matches.

Pre-aggregation

The next most common query submitted over a data set is an aggregate query. These typically answer questions similar to “what is the count of my customers from each country I sell to?”. They contain a set of grouping criteria (country) and an aggregate function (count) and produce a result set with a row for each group/aggregate pair.

In a lot of data platforms, hints are included along with the data in order to speed up queries of this type but, as with filtering, Kafka has none of these. Luckily, we can apply a similar principle as with the filtering above to close the gap.

Streambased allows you to specify a set of grouping fields and a set of aggregate fields that will be used in common queries of this type. Streambased indexing will then pre-compute sum, count, min and max aggregates for these combinations over blocks of offsets as with filtering. The result is a pre-computed count for offsets 0-1000,1001-2000 and so on and so on.

When a query of this type is run, Streambased can use the pre-computed values rather than computing the values by reading records, removing both the expensive read and compute stages of the query.

The unindexed head

Streambased prides itself on using all of the very latest data in its queries, however you can imagine that the techniques above can lag behind the latest data written to Kafka. Imagine a situation where 10010 messages had been written to Kafka but only 10000 have been indexed, it’s no good skipping those last 10 offsets in our query.

Streambased addresses this by reading the unindexed head (the last 10 messages) and combining them with the indexed data pre-computed above. By combining in this way you can be certain that your query includes everything available to it. One nice side effect of this is that, should something go wrong and index information not be available, Streambased query performance will gracefully degrade according to how much index information is available but the result will always be correct. A query with only half its data indexed will be faster than a query with no data indexed and slower than a query with all data indexed. They will all, however, produce the same result.

PreviousHow Streambased is different NextIndexing Kafka Data

Last updated 24 days ago