2018-01-01

Kafka技术内幕

《Kafka技术内幕》
图灵社区主页 |
ChinaPub购买链接 |
京东购买链接 |

book

Update:

2017-11-22: 公司内部做的一个分享：Kafka构建流式数据处理平台

本书介绍：

本书主要以0.10版本的Kafka源码为基础，并通过图文详解的方式分析Kafka内部组件的实现细节，全书原创的图片有近400幅。对于Kafka流处理的一些新特性，也会分析0.11版本的相关源码。本书各个章节的主要内容如下。

第一章首先介绍了Kafka作为流式数据平台的三个组成，包括消息系统、存储系统、流处理系统。Kafka基本概念的三种模型，包括分区模型、消费模型、分布式模型。然后介绍了Kafka几个比较重要的设计思路，最后介绍了如何在一台机器上模拟单机模式与分布式模式，以及如何搭建源码开发环境。
第二章从一个生产者的示例开始，引出了新版本生产者的两种消息发送方式。生产者客户端利用记录收集器、发送线程，对消息集进行分组和缓存，并为目标节点创建生产请求，发送到不同的代理节点。接着介绍了与网络相关的Kafka通道、选择器、轮询等NIO操作。另外还介绍了Scala版本的旧生产者，它使用阻塞通道的方式发送请求。最后介绍了服务端采用Reactor模式处理客户端的请求。
第三章首先介绍了消费者相关的基础概念，然后从一个消费者的示例开始，引出了基于ZooKeeper的高级消费者API，理解高级API主要是要理解消费线程的模型以及变量的传递方式。接着介绍了消费者提交分区偏移量的两种方式。最后，我们举了一个低级API的示例，开发者需要自己实现一些比较复杂的逻辑处理才能保证消费程序的健壮性和稳定性。
第四章介绍了新版本的消费者，不同于旧版本的消费者，新版本去除了ZooKeeper的依赖，统一了旧版本的高级API和低级API，并提供了两种消费方式：订阅和分配。新版本引入订阅状态来管理消费者的订阅信息、并使用拉取器拉取消息。新版本的消费者没有使用拉取线程，而是采用轮询的方式拉取消息，它的性能比旧版本的消费者更好。另外还介绍了消费者采用回调器、处理器、监听器、适配器、组合模式、链式调用等实现不同类型的异步请求。最后，我们介绍了新消费者的心跳任务、提交偏移量以及三种消息处理语义的使用方式。
第五章介绍了新版本消费者相关的协调者实现，主要包括“加入组”与“同步组”。每个消费者都有一个客户端的协调者，服务端也有一个消费组级别的协调者负责处理所有消费者客户端的请求。当消费组触发再平衡操作，服务端的协调者会记录消费组元数据的变化，并通过状态机保证消费组状态的正常转换。本章会通过很多不同的示例场景来帮助读者理解消费组相关的实现。
第六章介绍了Kafka的存储层实现，包括日志的读写、日志的管理、日志的压缩等一些常用的日志操作。服务端通过副本管理器处理客户端的生产请求和拉取请求。接着介绍了副本机制相关的分区、副本、最高水位、复制点等一些概念。最后介绍了延迟操作接口与延迟缓存。服务端如果不能立即返回响应结果给客户端，会先将延迟操作缓存起来，直到请求处理完成或超时。
第七章介绍了作为服务端核心的Kafka控制器，它主要负责管理分区状态机和副本状态机，以及多种类型的监听器，比如代理节点上线和下线、删除主题、重新分配分区等。控制器的一个重要职责是选举分区的主副本。不同代理节点根据控制器下发的请求，决定成为分区的主副本还是从副本。另外，我们还分析了本地副本与远程副本的区别，以及元数据缓存的作用。
第八章首先介绍了两种集群的同步工具：Kafka内置的MirrorMaker和Uber开源的uReplicator。接着介绍了新版本Kafka提供的连接器框架，以及如何开发一个自定义的连接器。然后介绍了连接器的架构模型的具体实现，主要包括数据模型、Connector模型、Worker模型。
第九章介绍了Kafka流处理的两种API：低级Processor API和高级DSL。本章我们重点介绍了流处理的线程模型，主要包括流实例、流线程、流任务。我们还介绍了流处理的本地状态存储，它主要用来作为备份任务的数据恢复。高级DSL包括两个组件：KStream与KTable，它们都定义了一些常用的流处理算子操作，比如无状态的操作（过滤、映射等）、有状态的操作（连接、窗口等）。
第十章介绍了Kafka的一些高级特性，比如客户端的配额、新的消息格式、事务特性。

本书相关的示例代码在笔者的Github主页https://github.com/zqhxuyuan/kafka-book上，另外，限于篇幅，本书的附录部分会放在个人博客上。由于个人能力有限，文中的错误在所难免，读者在阅读本书的过程中，发现不妥之处，可以私信笔者的微博：http://weibo.com/xuyuantree，笔者会定期将勘误表更新到个人博客上。

English Introduce

《Apache Kafka Internal》

This book mostly based on Kafka-0.10, and some part of 0.11 for streaming. It has nearly 400 pictures to analysis Kafka internal implementation. The book written from client to coordinator, from storage to controller, and also including Kafka Connect and Kafka Streams. Here is content introduction of each chapter:

Chapter 1: Being a streaming platform, kafka composed of message system, storage and streaming processing. There are three model of Kafka basic concepts: Partition model, Consumer model and Distributed model. We also introduce some important design ideas of kafka, such as file system persistent, data transformation, producer and consumer, replication and HA.

Chapter 2: From a producer example into how client send message. The whole workflow include record accumulator, sender thread, grouping message, create request and at last send to different target broker. Then we introduce Kafka channel, selector and also how server use NIO reactor to handle client request.

Chapter 3: From a old high-level consumer example into zookeeper based api. The most important of high-level consuemr is consumer thread model. Then we introduce two approach to commit consumer offset which is zookeeper or internal topic. After that, we illustrate how to write low-level consumer to ensure processing messages stability and robust.

Chapter 4: New version consumer client use subscription state and polling fetch instead of fetcher thread. We also introduce how consumer use callback, handler, listener, adapter, chain to implement different asynchronous request mode. Last we introduce heartbeat, offset commit and three consumer processing semantic: at-most-once,at-least-once,exactly-once.

Chapter 5: New consumer communicate with server coordinator by ConsumerCoordinator, there’re mainly two request/response involved: Join-group and Sync-group. This process also called consumer group rebalance. We also discussed how server coordinator use state machine to ensure group state transformation, such as PreparingRebalance,AwaitingSync,Stable. This chapter also give some different scene to help reader understand how consumer group worked in production environment.

Chapter 6: Kafka’s storage layer process include log read/write, log manager, log compaction. In server side, ReplicationManager is responsible for client’s request. Then we introduce Replication mechanism concepts, such as Partition, Replication, HW, LEO. Last we introduce delayed operation and delayed purgatory. If server can’t response immediately to client, they have to cache request and send response to client some times later.

Chapter 7: Kafka Controller component is in charge of managing PartitionState, ReplicationState, and some listeners, such as broker up/down, topic deletion, partition reassign. The main duty of controller is selecting partition’s leader and sent LeaderAndIsr request down to brokers. Target brokers receiving request will decide to be partition leader or follower. Furthermore, we introduce the different between local replication and remote replication, also the function of metadata cache.

Chapter 8: First we introduce two kind of cluster synchronization: Kafka internal MirrorMaker and Uber open sourced uReplicator, we also show how apache helix build replicated uReplicator. Next we introduce new build-in kafka connect framework and how to develop a custom connector plugin. Then we deep into connector’s architecture, mainly concentrate on data model, connector model, worker model.

Chapter 9: Introduce Kafka Streams two api: low-level processor and high-level DSL. This chapter focus on streaming thread model, including stream instance, thread and task. We also introduce local state store used by standby task for recovery. After that, we introduce two abstract components in High-level DSL: KStream and KTable, they both based on low-level processor, support common operator and advance function, such as window, join and so on.

Chapter 10: Introduce some advanced features. such as client quota, new message format in 0.11 and also transaction support.