Joke Collection Website - Public benefit messages - How to monitor operation and maintenance?

How to monitor operation and maintenance?

Unified monitoring platform, in the final analysis, is also a monitoring system, and the basic ability of monitoring is essential. Back to the essence of monitoring, let's first sort out the whole monitoring system:

① The essence of monitoring system is to ensure the stability of business by finding faults, solving faults and preventing faults.

② The monitoring system generally includes six modules: data acquisition, data detection, alarm management, fault management, view management and monitoring management. Data acquisition, data detection and alarm processing are the minimum closed loop of monitoring, but in order to really do a good job of monitoring system, fault management closed loop, view management, monitoring management and other modules are also indispensable.

First, data collection.

1, acquisition mode

Data collection methods are generally divided into proxy mode and non-proxy mode;

The proxy mode includes plug-in collection, script collection, log collection, process collection, APM probe and so on.

Non-proxy mode includes general protocol acquisition, Web dial-up test, API interface and so on.

2. Data type

There are three types of monitoring data: indicators, logs and tracking data.

Indicator data is a numerical monitoring item, which is mainly identified by dimensions.

Log data is character data, mainly to find some keyword information for monitoring.

Tracking data feedback is the process of tracking a data stream in the link, and observing whether the time-consuming performance in the process is normal.

3. Acquisition frequency

There are three types of acquisition frequency: minute, minute and random. The common acquisition frequency is minutes.

4. Access and transmission

Acquisition and transmission can be classified by transmission initiation or transmission link.

According to the transmission initiation, there are active acquisition pull and passive reception push.

According to the transmission link, there are direct connection mode and proxy transmission.

Among them, proxy transmission can not only solve the problem of cross-network transmission of monitoring data, but also alleviate the bottleneck of data transmission caused by too many monitoring nodes, and realize data diversion by proxy.

5. Data storage

For the monitoring system, there are mainly three kinds of storage to choose from.

① Relational database

Such as MySQL, MSSQL, DB2;; Typical monitoring system representatives: Zha bicks, SCOM, Tivoli; ;

Due to the limitation of the database itself, it is difficult to deal with massive monitoring scenarios, and there is a performance bottleneck, which is only commonly used in traditional monitoring systems.

② Time series database

The database designed to monitor this scene is good at storing and calculating index data. Such as InfluxDB, OpenTSDB (based on Hbase), Prometheus, etc. Typical monitoring system representatives: TICK monitoring framework, Open-falcon, Prometheus.

③ Full-text retrieval database

This type of database is mainly used for log storage and is very friendly to data retrieval, such as Elasticsearch.

Second, data detection.

1. Data processing

① data cleaning

Data cleaning, such as log data cleaning, needs to extract useful data because of the unstructured log data and low information density.

② data calculation

Many original performance data cannot be directly used to judge whether the data is abnormal or not. For example, the data collected is the total number of disks and disk usage. If you want to detect disk usage, you need to perform four simple operations on the existing indicators to obtain disk usage.

③ Rich data.

Data enrichment is to put some labels on the data, such as marking the host computer and computer room, which is convenient for aggregate calculation.

④ Index derivation

Indicator derivation refers to calculating new indicators through existing indicators.

2. Detection algorithm

There are fixed rules and machine learning algorithms. Fixed algorithm is a common algorithm, such as static threshold, year-on-year comparison, user-defined rules and so on, while machine learning mainly includes dynamic baseline, burr detection, index prediction, multi-index association detection and so on.

Whether it is fixed rules or machine learning, there will be corresponding judgment rules, that is, commonness

Third, the alarm management

1. Rich alarms

There are plenty of alarms, and to prepare for the subsequent alarm event analysis, auxiliary information is needed to judge how to handle, analyze and notify.

Generally speaking, alarm enrichment is to enrich alarm fields and related information by linking data sources such as CMDB, knowledge base and job history through rules. Manual labeling is also a rich way, but it is difficult to land in the actual scene because of the high labor cost.

2. Alarm convergence

There are three ideas for alarm aggregation: suppression, shielding and aggregation.

① inhibition

That is, suppress the same problem and avoid repeated alarms. Common suppression schemes include anti-shake suppression, dependence suppression, time suppression, combination condition suppression, high availability suppression and so on.

② shielding

Shielding predictable situations, such as changing the maintenance cycle and fixing periodic tasks, are all things that are already known to happen and have been expected in my heart.

③ polymerization

Aggregation is to combine similar or identical alarms, because the same phenomenon may be fed back. For example, if the number of business visits increases, the performance of CPU, memory, disk IO and network IO of the host computer carrying the business will increase sharply, so that these performance indicators are summarized together, which is more convenient for alarm analysis and processing.

3. Alert notification

(1) Inform people

People can get in touch through some regular notification channels.

In this way, when no one is staring at the screen, it can be triggered to the staff through WeChat, SMS and email.

(2) notification system.

Generally, it is pushed to a third-party system through API to facilitate subsequent event handling.

In addition, you need to support custom channel expansion (for example, enterprises have their own IM system, which can be accessed by themselves).

Fourth, fault management.

Alarm events must be handled in a closed loop, otherwise monitoring is meaningless.

The most common is manual processing: on duty, work orders, fault escalation, etc.

Experience accumulation can accumulate manual faults into knowledge base for reference in subsequent fault handling.

Automatic processing, by extracting the solidification processing flow of some specific alarms, the fault self-healing of specific scenes is realized; For example, clear some useless logs when disk space alarms.

Intelligent analysis mainly improves the efficiency of fault location and processing through AI algorithms such as fault correlation analysis, location and prediction.

1. View management

View management is also a value-added function, mainly to meet people's psychological needs, so there are many roles (leaders, administrators, duty officers, etc.). ).

Big screen: face the leader and provide a global overview.

Topology structure: provide alarm correlation and influence surface view for operation and maintenance personnel.

Dashboard: For operators and maintenance personnel, it provides a customized view of attention indicators.

Report: provide some statistical summary report information for operation and maintenance personnel and leaders, such as weekly reports and daily reports.

Retrieval: for operation and maintenance personnel, it is used for all kinds of data retrieval in fault analysis scenarios.

2. Monitoring management

Monitoring management is the biggest challenge in the process of enterprise monitoring. The first five modules are the service functions provided by the monitoring system, while the monitoring management is to manage and control the monitoring system itself and pay attention to the functional presentation of the real landing process. There are mainly the following aspects:

Configuration: simple, batch and automatic

Coverage: a measure of monitoring level

Indicator database: specifications of monitoring indicators

Move: Deal with problems anytime, anywhere.

Permission: access control

Audit: managing compliance

API: the largest source of operation and maintenance data, used for data consumption.

Self-monitoring: the guarantee of self-stability

In order to realize the above six basic monitoring capability modules, we can design our unified monitoring platform according to the following architecture.

Mainly divided into three layers, access layer, capability layer and function layer.

The access layer mainly considers the access of all kinds of data. In addition to the acquisition and access of its own agents and plug-ins, it is also necessary to support the data access of third-party monitoring sources in order to make a complete unified monitoring platform.

The ability layer mainly considers the basic general ability of monitoring, including data acquisition module, data storage module, data processing module, data detection module and AI analysis module.

The functional layer needs to be close to the user's usage scenarios, mainly including management and display functions, which can continuously enrich the functional scenarios during the construction process.

In addition, considering the correlation of data, it lays a foundation for future data analysis. Monitoring and CMDB also need to be closely linked, and all monitored objects should be managed by CMDB. In addition, driver monitoring can be configured as the guiding concept to realize automatic online and offline monitoring, and the person in charge can be automatically identified through alarm notification, which simplifies the maintenance and management of monitoring.

In order to unify the monitoring platform in enterprises, we need to be equipped with corresponding management systems, the most important of which is the index management system.

The core idea of index management system:

The monitoring indicator system takes CMDB as the skeleton and monitoring indicators as meridians, which organically integrates the data of the whole unified monitoring platform.

Through the life cycle management of indicators, supplemented by the management norms of indicators, the long-term orderly operation of the monitoring platform is guaranteed.

From the perspective of enterprise business application, the object of enterprise monitoring is generally divided into six layers, which can also be adjusted according to the enterprise's own situation:

Infrastructure layer

Hardware device layer

Operating system layer

Component service layer

Application performance layer

Business operation layer