Joke Collection Website - Bulletin headlines - Decryption slogan

Decryption slogan

Apache Kafka (referred to as Kafka) was originally a distributed message system developed by Linkedln, and now it is a sub-project of Apache. Kafka community is also very active and has become the most widely used message system in this field. From the version, Kafka's slogan changed from "a high throughput and distributed message system" to "a distributed streaming media platform".

About Kafka, I'm going to start with the introduction until I realize its underlying logic principle and source code. I suggest you read it patiently from the beginning. I believe you will gain something.

As a streaming data platform, the most important thing is to have the following characteristics.

Message system:

There are two main message models: queue and publish-subscribe. Kafka used a consumer group. When Kafka uses the queue model, it can evenly distribute the processing to the consumer members in the consumer group.

Below we will analyze several basic concepts of Kafka from different angles and try to solve the following problems.

After producers publish messages to fk cluster, consumers will use two kinds of message consumption models: push model (pu) and pull model (message system based on push model, in which message agents record consumers' consumption status and mark messages as consumed after pushing them to consumers.

However, this method can't guarantee the processing semantics of messages well. For example, after the message agent sends a message, when the consuming process hangs up or the message is not received due to network reasons, the message may be lost (because the message agent has marked the message as its own consumption, but it has not actually processed the message). In order to ensure the processing semantics of the message, the message broker should set the status to "Sent" after sending the message. It will only be updated as "consumed" after receiving the confirmation request from the consumer, which requires recording the consumption status of all messages in the message broker, which is also undesirable.

Multiple partition logs of each topic in Kafka are distributed and stored in Kafka cluster. At the same time, for fault tolerance, each partition will be copied to multiple message proxy nodes, one of which will be the master copy and its node will be the backup copy (slave copy, also known as slave copy).

The primary copy will be responsible for all client read and write operations, and the backup copy will only synchronize the data in the primary copy. When the primary copy IH fails now, the copy in the backup copy will be selected as the new primary copy. Because only the master copy in each partition accepts reading and writing, each server will be the master copy of some partitions and the backup copy of other partitions, so that all servers in Kafka cluster as a whole balance the load of clients.

Message system usually consists of producer "consumer" and message agent (broker). The producer writes the message to the message broker, and the consumer reads the message from the message broker. For message agents, producers and consumers belong to clients: producers and consumers will send client requests to the server, and the server will store and obtain messages respectively, and finally the server will return the response results to the client.

The new producer application uses af application objects to represent producer client processes. Producers want to send messages, not directly to the server. Instead, they first put messages in the queue of the client, and then the message sending thread sends messages from the queue. Reco dACCUl'lUlato, which sends messages to server Kafka in the form of salt, is responsible for caching messages generated by producer clients, and the sending thread (Sende) is responsible for reading the batch network of aggregators and sending them to the server. In order to ensure the quick response of the client's network request, Kafka uses a selector (select network connection to read and write, and let network connection (Netwo kCl i.ent) handle the client's network request.

When messages are appended to the record collector, they are grouped by partition and put into the batches collection. The queue of each partition holds records to be sent to the node corresponding to that partition. The sending thread of the client can only use the Sende thread to overlay each partition of the batch, obtain the main script node corresponding to the partition, and take out the batch record in the column corresponding to the partition to send the message.

There are two ways for the message sending thread to send messages directly according to the partition, and the target nodes of the partition overlap. Suppose there are two servers with partitions, then each server has a partition. The message sending thread traverses each tap of the batch to send the message to the primary replica node of the partition, and there is always a request. I will first group them according to the primary replica nodes of the partition, and all partitions belonging to the same node are put together, so there are always only two requests, which can greatly reduce the network.

The message system consists of producer storage system and consumer. This chapter analyzes the process that producers send messages to the server. This chapter analyzes the process of consumers reading messages written by producers from the server storage system. First of all, I will know some basic knowledge of consumers.

As a distributed message system, Kafka supports multiple producers and consumers, and producers can publish messages to different partitions of different nodes in the cluster. Phillips can also consume messages on multiple partitions of multiple nodes in a cluster. When writing messages, multiple generators can read messages in the same partition. If multiple users read multiple partitions at the same time, in order to ensure that different data of the log file are distributed to different users, it is necessary to control the log file at the partition level by locking synchronization and other methods.

On the contrary, if it is agreed that "the same partition can only be handled by one consumer", there is no need to lock synchronization, thus improving the processing ability of consumers, and this does not violate the processing semantics of messages: it used to be handled by multiple consumers, but now it can be handled by one consumer. 3- The simplest deployment mode of message system is given. The producer's data sources are diverse, and they are all co-authors. When Kafka cluster processes messages, there are many consumers sharing tasks, and the processing logic of these consumers is the same.

Because the partition will be redistributed and the owner of the partition will change, all consumers will stop the existing revocation process before the partition is redistributed. At the same time, when allocating partitions to consumers, the owner information will be recorded in ZK, so the node data on ZK should be deleted first. The partition can only be allocated when the owner pull thread associated with the partition is released.

If this information is not published before redistributing partitions, the same partition may be owned by multiple users after rebalancing. For example, the partition Pl was originally owned by the consumer. If the revocation process and ZK node are not released, the rebalanced partition Pl will be distributed to consumers, so both consumers and consumers will enjoy the partition Pl, which obviously does not meet the restriction of "one partition can only be distributed to one consumer" in fka. The steps of rebalancing operation are as follows.

If the coordinator node fails, the server will have its own fault-tolerant mechanism to select a new coordinator part to manage all users in the consumer group. Consumer customers have no right to do this work. All it can do is wait for a while and ask the server if it has selected a new coordinator node. If the consumer finds that there is now a coordinator node that manages the coordinator, he will connect this new coordinator part. Because this coordinator node is newly selected by the server, every consumer should reconnect the coordinator node.

When a consumer rejoins a consumer group, it will have an impact on the consumer's pulling work before and after being assigned to a partition. Before sending the "Join Group Request", the consumer should stop pulling messages, and after receiving the partition in the "Join Group Response", the client can also set a custom "consumer rebalancing listener" before and after joining the group, so as to properly handle the change of the partition.