Joke Collection Website - Blessing messages - On-call mechanism for effective operation and maintenance

On-call mechanism for effective operation and maintenance

[Editor's note] Chen Bolong, the author of this article, is the founder of OneAlert, a cloud early warning platform, and has been engaged in IT operation management and cloud computing for over 10 years.

The development of Internet technology is inseparable from the support of operation and maintenance. There is no bug-free program and no problem-free system. It is not terrible to have problems and failures. The terrible thing is that it can't be handled in an orderly way:

How to deal with the emergency driver effectively has become the key to operation and maintenance (especially for the operation and maintenance supervisor). I have been exposed to the operation and maintenance of a large number of companies. From start-ups, small and medium-sized companies and large companies, I summarized and shared some on-call mechanisms commonly used by most companies to help deal with emergencies in an orderly manner:

Basically, it is carried out around people, processes and tools, and refers to ITIL's management thought. If you are interested, you can also refer to it, especially the operation management of ITIL V3.

Most companies use monitoring tools such as zabbix, nagios and open-falcon to monitor hardware, networks and applications. There may be a problem of decentralized monitoring:

Alarm concentration means that all alarm events found in production monitoring are concentrated together, so it is enough for us to stare at a platform, and it is easy to analyze whether the problem is the same or similar.

If the monitoring tool is single, centralization is not the most necessary, but how to deal with it in an orderly manner is the core. The special operation and maintenance team consists of 3-5 people to dozens/hundreds of people, and it is necessary to sort out the support process and response mechanism.

If the management is more detailed, the business will be split to form a matrix. For example, the first line and the second line are based on different majors, such as teams responsible for networks and different applications.

In addition, the severity level of the alarm should be considered and differentiated. Strict students generally establish a response level [1-3] or [1-5]:

Then the question is coming, the planning and design are quite good, how to land? At present, monitoring tools such as zabbix, nagios and open-falcon focus more on how to find problems, while the support process belongs to the category of dealing with problems or management. At present, there are few suitable tools on the market:

I contacted an Internet finance company and designed a very standardized process and P0-P5 emergency response plan, involving networks, cloud platforms and nearly 50 application R&D teams.

Task upgrade

Dispatching management

No matter how good the technology and design are, we will be very depressed if we don't receive the notice and deal with it in time. The solution to the last mile problem is:

It also supports the following points: daytime working hours do not need different levels and different time periods, such as serious telephone notification at night.

There is still a problem. When the alarm scale is large, especially in the alarm storm, it is easy to cause the mailbox or mobile phone short message to explode, so let's talk about avoiding the alarm storm.

This problem is relatively big, basically some monitoring tools have done a part, and it is also an industry problem at present. Simply put:

We tried to share the following:

Machine learning alarm merging

If there are a large number of alarms, the follow-up processing and tracking of alarms often depend on external teams (external departments or companies). But the granularity of monitoring alarms is too fine, and many alarms may be one thing. For example, in the above alarm storm, a large number of exceptions were triggered due to application failures, and then a chain reaction occurred. In fact, it's just one thing. There's only one thing to deal with.

Generally speaking, front-line personnel will directly notify the corresponding person in charge by email or telephone, but it is difficult to track and analyze afterwards, so an event management mechanism is set up.

ITIL standard event process is of great reference value, and interested students can refer to it. Event work order requirements:

Event table

The cross matrix of influence scope and urgency affects priority.

After the on-call mechanism was established, through the analysis of alarm and event data, a team culture driven by data indicators was established, which I have the opportunity to share with you.

OneA lert, a product of OneAPM, is the first SaaS-based cloud alarm platform in China. IT integrates mainstream monitoring/support systems at home and abroad to realize centralized processing of all IT events and improve IT reliability on one platform. To read more technical articles, please visit the official technical blog of OneAPM.

This article is transferred from OneAPM official blog.