Unifying Stream Processing and Interactive Queries in Apache Kafka

Interactive Queries allows you to get more than just processing from streaming. It allows you to treat the stream processing layer as a lightweight, embedded database and directly query the state of your stream processing application, without needing to materialize that state to external databases or storage. Apache Kafka maintains and manages that state and guarantees high availability and fault tolerance. As such, this new feature enables the hyper-convergence of processing and storage into one easy-to-use application that uses the Apache Kafka’s Streams API.

交互式查询(Interactive Queries)并不仅仅是对流数据进行处理(stream processing),它相当于将流处理这一层看做是一个轻量级的、内置的数据库、直接查询流处理应用程序的状态,它不需要将状态(数据)物化(materialize)到外部数据库或存储系统中。Kafka会自己维护和管理状态,并且保证了高可用、容错。这个新特性使得处理逻辑变得高度聚合,并且可以使用Kafka的Streams API存储(状态)到一个易用的应用程序中。

The idea behind Interactive Queries is not new; a similar concept actually originated in traditional databases where it’s often known as “materialized views.” Though materialized views are very useful, the way they are implemented in databases is not suitable for modern application development – they force you to write your code all in SQL and deploy it into the database server. With our database background combined with our stream processing experience, we visited a key question: can the concept behind materialized views be applied to a modern stream processing engine to create a powerful and general purpose construct for building stateful applications and microservices? In this blog post, we show how Apache Kafka, through Interactive Queries, helps us do exactly that.

交互式查询背后的思想并不新鲜,在传统的数据库中相似的概念是“物化视图”,尽管物化视图非常有用,但是数据库的这种实现方式对现代的应用程序开发并不适合(因为他们会强制你在SQL中编写所有的代码,并部署到服务端)。结合数据库的背景知识和流处理的体验,我们在思考一个关键的问题:物化视图是否如果应用到现代的流处理引擎,就可以构建出一个强壮的、通用的,有状态的应用程序(或者微服务),本篇博文我们会向你展示Apache Kafka如何通过交互式查询帮助我们实现了这个目标。

When we set out to design the stream processing API for Apache Kafka – Kafka Streams – a key motivation was to rethink the existing solution space for stream processing. Here, our vision has been to move stream processing out of the big data niche and make it available as a mainstream application development model. For us, the key to executing that vision is to radically simplify how users can process data at any scale – small, medium, large – and in fact, one of our mantras is “Build Applications, Not Clusters!” In the past, we wrote about three ways Apache Kafka simplifies the stream processing architecture – by eliminating the need for a separate cluster, having good abstractions for streams and tables and keeping the overall architecture simple. Interactive Queries is another feature to enable this vision.

在为Kafka设计流式处理API时(即Kafka Streams),一个重要的动机是重新思考已有的流处理系统的局限性,我们的焦点已经从大数据领域转到了流处理,并且将其作为可用的主流的应用程序开发模型。执行这个愿景的关键是从根本上简化用户如何处理数据的扩展性问题(不管是小批量的数据、中等规模的、大规模的数据),实际上我们的一个口号是:“构建应用程序,不要集群!”。之前的博文中我们写了Kafka简化流处理的三种模式:不需要单独的集群、对Streams和Table进行良好的抽象、保持整体架构的简洁。现在,交互式查询是这个愿景的另一个特性。

In this blog post, we’ll start by digging deeper into the motivation behind Interactive Queries through a concrete example that outlines its applicability. Then we will describe how Interactive Queries works under the hood and provide a summary of related resources for further reading.



Let’s use an end-to-end example to pick up where we left off in our previous article on why stream processing applications need state. In that article, we described some simple stateful operations, e.g., if you are grouping data by some field and counting, then the state you maintain would be the counts that have accumulated so far. Or if you are joining two streams, the state would be the rows in each stream waiting to find a match in the other stream.


Now as a driving example in this blog, consider a financial institution, like a wealth management firm or a hedge fund, that maintains positions in assets held by the firm and/or its client investors. Maintaining positions means that the bank needs to keep track of the risk associated with those particular assets. The bank continuously collects business events and other data that could potentially influence the risk associated with a given position. This data includes market data fluctuations on the price of the asset, foreign exchange rates, research, or even news information that could influence the reputation of people involved with the asset. Any time this data changes, the risk position needs to be recalculated in order to keep a real-time view of the risk associated with each individual asset as well as on entire portfolios of investments.

Real-time risk management is an example of a stateful application. At a minimum, state is needed to keep track of the latest position for every asset. State is also needed inside the stream processing engine to keep track of various aggregate statistics, like the number of times an asset is traded in a day and the average bid/ask spread. The collected state needs to be continuously updated and queried.



在没有Kafka Streams之前,典型的架构图如下:
arch before


  1. 一个额外的Hadoop集群重复处理数据
  2. 在流处理层也要维护存储,因为流处理需要查询和聚合
  3. 流处理作业和批处理作业的输出都需要维护存储和数据库
  4. 写放大无法避免,

  1. 1. 示例:实时风险管理
  2. 2. 现在的做法