Joke Collection Website - Mood Talk - How to design a real-time analysis platform for big data?

How to design a real-time analysis platform for big data?

As a customized version of Vertica's analysis products based on CreditEase, PetaBase-V provides real-time analysis services for big data, which can linearly expand the computing power and data processing capacity of the cluster by using MPP. PetaBase-V is based on column database technology, which has the characteristics of high performance, high expansibility, high compression rate and high robustness, and can perfectly solve the performance problems such as slow report calculation speed and detailed data query.

The big data real-time analysis platform (hereinafter referred to as PB-S) aims to provide end-to-end real-time data processing capability (millisecond/second/minute delay), which can extract real-time data from multiple data sources and provide real-time data consumption for multiple data application scenarios. As a part of modern data warehouse, PB-S can support real-time, virtualization, popularization, collaboration and other capabilities, which makes real-time data application development threshold lower, iteration faster, quality better, operation more stable, operation and maintenance simpler and capability stronger.

Overall design idea

We unified and abstracted four levels of user needs:

Unified data acquisition platform

Unified streaming processing platform

Unified computing service platform

Unified data visualization platform

At the same time, it also maintains the principle of openness to the storage layer, which means that users can choose different storage layers to meet the needs of specific projects without destroying the overall architecture design, and users can even choose multiple heterogeneous storage in the pipeline to provide support. The four abstract layers are explained below.

1) unified data acquisition platform

The unified data acquisition platform can support complete extraction and enhanced extraction from different data sources. Among them, the incremental extraction of business database will choose to read database logs to reduce the reading pressure of business database. The platform can also process the extracted data in a unified way, and then publish it to the data bus in a unified format. Here, we choose the customized standardized unified message format UMS(Unified Message Schema) as the data layer protocol between the unified data acquisition platform and the unified stream processing platform.

UMS comes with namespace information and schema information, which is a self-locating and self-explaining message protocol format. The advantages of this are:

The whole architecture does not need to rely on external metadata management platform;

The message is decoupled from the physical media (the physical media here refers to Kafka's topic, the flow of spark flow, etc. ), so you can support the parallelism of multiple message streams and the free drift of message streams through physical media.

The platform also supports multi-tenant systems and the ability to simply clean up configurations.

2) Unified stream processing platform

The unified stream processing platform will consume messages from the data bus, and can support UMS protocol messages or ordinary JSON format messages. At the same time, the platform also supports the following functions:

Support visualization/configuration //SQL, and lower the threshold of development/deployment/management of flow logic.

Support configuration mode idempotent falling into multiple heterogeneous target libraries to ensure the final consistency of data.

Support multi-tenant system and realize project-level isolation of computing resources/table resources/user resources.

3) Unified computing service platform

Unified computing service platform is an implementation of data virtualization/data federation. The platform supports push-down calculation and pull-in mixed calculation of multi-heterogeneous data sources internally, and supports unified service interface (JDBC/REST) and unified query language (SQL) externally. Because the platform can provide unified services, modules such as unified metadata management, data quality management, data security audit and data security policy can be built based on the platform. The platform also supports multi-tenant systems.

4) Unified data visualization platform

The unified data visualization platform, combined with multi-tenant and perfect user system/authority system, can support the division of labor and collaboration ability of cross-departmental data practitioners, so that users can better play their respective advantages to complete the application of the last ten kilometers of the data platform through close cooperation in the visualization environment.

The above is based on the overall module architecture, adopting unified abstract design and open storage options to improve flexibility and demand adaptability. The RTDP platform design embodies the real-time/virtualization/popularization/collaboration capability of modern data warehouse, covering the end-to-end OLPP data flow link.

Specific problems and solutions

Next, based on the overall architecture design of PB-S, we will discuss the problems and solutions that this design needs to face from different dimensions.

Functional considerations mainly discuss this question: can real-time pipelines handle all ETL complex logic?

As we know, for a stream computing engine like Storm/Flink, it is processed on a project-by-project basis; For Spark Streaming flow computing engine, it is processed according to each mini-batch; For off-line batch running tasks, it is handled according to daily data. Therefore, the processing range is a dimension of data (range dimension).

In addition, streaming is oriented to incremental data. If the data source is from a relational database, then incremental data often refers to incremental change data (revision). Relative batch processing is oriented to snapshot data. Therefore, the presentation form is another dimension of data (change dimension).

The change dimension of a single data can be projected and aggregated into a single snapshot, so the change dimension can be aggregated into the range dimension. So the essential difference between stream processing and batch processing lies in the different dimensions of data range. The stream processing unit is "limited range" and the batch processing unit is "full table range". "Full Table Range" data can support various SQL operators, while "Limited Range" data can only support some SQL operators.

Complex ETL is not a single operator, but often consists of multiple operators. As can be seen from the above, simple stream processing can not support all ETL complex logic well. So how to support more complex ETL operators in real-time pipeline and keep timeliness? This requires the mutual conversion ability of "limited range" and "full table range" processing.

Imagine that the stream processing platform can support appropriate processing on the stream, and then put down different heterogeneous libraries in real time. The computing service platform can periodically (the time can be set to every few minutes or less) batch hybrid computing multi-source heterogeneous libraries, and send each batch of computing results to the data bus to continue the flow. In this way, the stream processing platform and the computing service platform form a computing closed loop, each of which is good at operator processing, and the data is transformed by each operator in the process of triggering streams at different frequencies. Theoretically, this architecture mode can support all ETL.

2) Quality considerations

The above introduction also leads to two mainstream real-time data processing architectures: Lambda architecture and Kappa architecture. There is a lot of information about the introduction of these two architectures on the internet, so I won't go into details here. Lambda architecture and Kappa architecture have their own advantages and disadvantages, but both support the final consistency of data and ensure the data quality to some extent. How Lambda architecture and Kappa architecture learn from each other's strong points to form a certain fusion architecture will be discussed in detail in other articles.

Of course, data quality is also a very big topic. Only supporting re-running and re-injection can not completely solve all data quality problems, but only gives the engineering scheme of supplementary data from the technical architecture level. Regarding the quality of big data, we will also discuss a new topic.

3) Stability considerations

This topic covers but is not limited to the following points. Here's a simple solution:

High availability HA

High availability components should be selected for the whole real-time pipeline link to ensure the overall high availability in theory; Data backup and playback mechanism supporting key links of data; Dual-operation integration mechanism supporting key business links.

SLA guarantee

On the premise of ensuring high availability of cluster and real-time pipeline, it supports dynamic capacity expansion and automatic drift of data processing flow.

Elastic brittleness resistance

Elastic expansion of resources based on rules and algorithms

Fault handling of event-triggered action engine

Monitoring and early warning

Multi-faceted monitoring and early warning capabilities in cluster facilities, physical pipelines and data logic.

Automatic operation and maintenance

It can capture and archive lost data, handle abnormal situations, and has a regular automatic retry mechanism to repair problem data.

Upstream metadata change resistance

The upstream business library requires compatibility metadata changes.

Real-time pipeline processing explicit fields

4) Cost considerations

This topic covers but is not limited to the following points. Here's a simple solution:

Human cost

Reduce the labor cost of talents by supporting the popularization of data applications.

Resource cost

Support dynamic resource utilization and reduce resource waste caused by static resource occupation.

Operation and maintenance costs

Reduce operation and maintenance costs by supporting automatic operation and maintenance/high availability/elastic anti-vulnerability mechanism.

Trial and error cost

Reduce trial and error costs by supporting agile development/rapid iteration.

5) Agile consideration

Agile big data is a set of theoretical system and methodology, which has been described in previous articles. From the perspective of data usage, agile consideration means: configuration, SQL, popularization.

6) Management considerations

Data management is also a very big topic, here we will focus on two aspects: metadata management and data security management. In the modern multi-warehouse and multi-data storage environment, it is a very challenging task to manage metadata and data security in a unified way. We will consider these two aspects separately and provide built-in support on each link platform of real-time pipeline, and also support external unified metadata management platform and unified data security strategy.

The above is the design scheme of big data real-time analysis platform PB-S.