Joke Collection Website - Public benefit messages - Why does server downtime usually occur at the lowest utilization rate in the early morning?

Why does server downtime usually occur at the lowest utilization rate in the early morning?

A piece of equipment was dropped in our unit at night. This device is stacked, there is no backup, and all downlinks are connected to the main device. As a result, the main equipment power module was damaged in the early morning of that night! This ... can you see the pattern? I also want to know why it must be damaged in the early morning!

Therefore, accidents can't be said at most!

However, it is normal to switch at night, and it is common sense to choose to do necessary things that may affect the business when users are the least.

First of all, I am honored to answer this question for you. Let's walk into this problem together, and now let's discuss it together.

I would like to share my personal views and opinions with you on this issue. I hope my answer is helpful to you, and you will like my sharing.

It's dark and windy at night, killing people and stealing goods This time is the rest time of normal people, and hackers choose to be active at this time. Both security attacks and DDOS may cause server failure.

If you have a better answer to this question, please comment and discuss this topic together.

Finally, here I am. I wish you all a happy life, happy work every day, healthy life every day, smooth family and prosperous business every year. Thank you!

16 Reliable answers from experienced old programmers.

There are several main reasons.

First of all, it is true that server downtime usually occurs at the lowest utilization rate in the early morning, but this utilization rate is only for users.

In fact, in the early morning, the server is very busy. What are you mainly busy with? Mainly some planning tasks and database backup. Many time-consuming operations, such as report statistics, will be arranged in the middle of the night to avoid affecting the normal business for half a day, so the servers are running at high load at this time, which is prone to accidents.

In the same way, releasing new codes or changing functions will also choose the low peak period of business at night. No matter how good the pre-test work is, it will inevitably hide some bugs. In the early morning, these bugs (such as infinite loop) have been running for a period of time, and unattended situations may trigger various failures.

If the online time is short, it's ok. When the update is big, the programmer fights until midnight. In this case, people are very tired and are more likely to make mistakes in busy work.

For example, infinite loops and memory leaks take some time to show up. With real-time monitoring during the day, the probability of natural failure is relatively small. Even if there is a fault, it can be repaired quickly so that users can't notice it.

It's dark and windy at night, killing people and stealing goods This time is the rest time of normal people, and hackers choose to be active at this time. Both security attacks and DDOS may cause server failure.

Ji Ke has been engaged in embedded software development for many years. Recently, because the company needs to engage in background research and development, it often chooses to upgrade in the early hours of the morning, and large-scale data processing is also put in this time period, and frequent server downtime is also in this time period. All users start tossing when they use less, and server problems are prone to occur when they toss more times. Because we are Internet of Things equipment, there are several downtime situations in our work. The operation of a large amount of data leads to a sudden increase in the proportion of CPU in a period of time, which leads to problems in the data receiving module and system monitoring, and many equipment information cannot be detected.

Operating the database too frequently leads to the decrease of efficiency, which is also an important link that affects the system performance. In fact, the server is also composed of ordinary computers, and the main resources are CPU and memory. Either of these two factors may lead to the collapse of the system. If the CPU is full, the response of the system will become extremely slow, and it may slow down after a long time. If the memory is full, the system will crash and cannot run directly. In fact, the core will be closed.

Now summarize the common server downtime problems:

1. Disk space is full. Now programmers are used to printing with log at runtime. If it takes a long time and there is no cleaning mechanism, problems will arise sooner or later. This error usually occurs during normal operation. If you use a cloud computing server, you usually send a short message before the system crashes, informing you that your system is on the verge of collapse.

2. Concurrent performance problem, if many people operate a database or data block at the same time, it will cause the system to fake death, which belongs to the problem of competing for CPU resources and can be solved by increasing hardware configuration and optimizing software code efficiency. How much data is there, you can consider distributed management.

3. Data is damaged or destroyed, resulting in system crash. Therefore, it is common practice to configure backup disks. If there is a problem, take the backup disk to the top. Now the company uses the server in Alibaba Cloud, and its stability is much better than before. Changed the telecom cloud in the middle. Although the price of Tencent Cloud is low, in the end, I couldn't help switching to Alibaba Cloud directly, and I don't want to switch back. The stability of data is always the first.

4. Some unnecessary misoperation is often caused by the misoperation of programmers or operation and maintenance personnel, resulting in a large-scale downtime of the server. This kind of incident has happened in many cloud service providers, and the fundamental level is management issues. Any detail of background management is possible.

Several clues to discover the problem of server downtime;

1. Check the server for memory leaks. Sometimes, when you restart the machine, it will run normally. It will become very slow after a while. Nine times out of ten, it is a memory problem.

2. Whether it is caused by hackers or not, some very critical and important data are also of the greatest interest to hackers. Generally speaking, this probability is not very high.

3. Is it caused by database deadlock, too many visits and too many connections?

Once the server goes down, it will cause countless complaints from users. No matter what happens, stability will always come first. Now unless the big function upgrade has been 100% verified successfully, the consequences will be unimaginable.

I hope I can help you.

The maintenance personnel from Huawei generally answered:

1, business class: system timing task. Such as statistical reports at night, task refresh, data refresh, or data backup. Wait a minute. It was all done in the early hours of the morning. At this time, CPU/ memory/space (disk/database) /IO (disk reading and writing) will be very high. Therefore, there may be downtime or insufficient resources.

2. Operation type: If operations such as cutover/upgrade/patching/rectification are needed, it may trigger the situation. In many cases, processes/services/systems need to be restarted.

3. bug class, whether it is a Linux system or a business system, there may be bugs that cause system crash or server downtime. This can also happen during the day.

4, hardware problems. Hardware such as single board/disk will age slowly due to the problem of actual service life, for example, the disk of disk array is easy to be damaged.

5. Sudden traffic congestion leads to a large amount of data, resulting in transmission and traffic congestion. And the disk space is full or the database tablespace is full. Cause problems. Anything can go wrong.

Occasionally, you may be shallow-minded, because this kind of problem is the deepest, and you may think more. Try by taking notes.

Downtime is generally divided into five situations:

1, the program crashed due to a problem.

2.cpu\Gpu and memory are full.

3. Hard disk space is full

4. The database tablespace is full.

5. The room temperature is too high.

The above are personal problems encountered in the operation and maintenance process, and make a summary answer.

Although few users use the system in the early hours of the morning, the server may have to do a lot of work at this time:

Let me talk about an experience of a server downtime that I saw a long time ago and shared with my peers. Some experiences are amazing. Consider it a joke (for convenience, I'll tell it in the first person).

The first party we serve is the hospital, and the computer room is in the hospital building. Recently, the server in the computer room often goes down, and the company's engineers have gone several times and found no problems. Later, the company was overwhelmed and decided to let an engineer live in the computer room at night to see what happened in the computer room in the middle of the night, thinking that even if the reason could not be found, the server could be restarted as soon as it stopped.

Then I found out why. At three or four o'clock in the morning, the door of the computer room opened and a little nurse who was on the night shift came in. She looked at it and said, "No one. Isn't it a waste of electricity to turn on the air conditioner? " Then I turned off the air conditioner in the computer room, and then the temperature rose. ...

Server downtime means that the server can't run normally for some reasons, which leads to the disconnection of the network and the inability to use the network normally. Server downtime usually occurs in the early hours of the morning. Why is this happening? For example, our company is engaged in the production of high-tech Internet equipment. In order not to affect the normal production, the system upgrade is usually in the early hours of the morning, and a lot of data processing is also carried out at this time. At this time, the server is also prone to problems. The specific analysis has the following reasons:

1. When the system is upgraded or a large amount of data is processed, the hard disk space will be full. If no one can clear the disk space in time, the server will get stuck and cause downtime.

2. If multiple devices are running at the same time, using this database will cause the system to feign death, which is caused by grabbing CPU resources, which will lead to the server soaring, the number of website visits soaring, the program poisoning, many applications are consuming the server, and finally it will crash and fail to respond.

3. Due to the reduction of maintenance personnel in the early hours of the morning, there will be situations such as power failure, high temperature and other environmental factors leading to server crash. However, this situation is rare, because there is a generator in the computer room to avoid data loss caused by power failure, and the temperature is also a constant temperature system.

4. In order to save the cost of servers, some enterprises will rent servers with low configuration to do a lot of work, which will overload the servers, and as a result, it is conceivable that there will be frequent downtime.

Generally speaking, server downtime has a great relationship with memory. Some servers slow down after running for a period of time, which is basically a memory problem. Check the memory for leaks.

A series of problems will occur when the server is down, and the losses caused are immeasurable. Do regular maintenance at ordinary times and pay attention to the use in the early morning to avoid downtime. At any time, the stable operation of the server is the most important.

What do you mean by server downtime? The "down" in our daily "downtime" actually refers to the English word "down", which means that the current server or service is unresponsive or offline.

Server downtime can be divided into man-made downtime and uncontrollable downtime. What's the difference between them? Explain in detail below:

1, man-made controllable shutdown behavior

The long-term operation of the server may bring some (non-fatal) problems, or when we need to upgrade and maintain the software/hardware of the server, we may need to stop or restart the operation. The downtime in this case is controllable and within our plan.

2. Uncontrolled shutdown behavior

There are many such factors, such as the sudden blue screen of the server, the abnormal collapse of the service, and the sudden power outage and disconnection. At this time, the service (server) can't provide the service normally, which is caused by uncontrollable factors.

In our daily operation and maintenance work, we usually choose to do these things in the middle of the night when we plan to stop maintenance. Why? There are several main reasons:

1 to reduce the impact on users.

Everyone is basically resting in the early morning, and the number of users is much less than during the day. Therefore, the downtime caused by system and hardware maintenance at this time has little impact on users, and even if there is, it will only affect a small number of users.

2, have enough time to deal with the fault.

If the maintenance is carried out in the early hours of the morning, even if there is a problem, the technicians will have enough time (for example, 00-05) to deal with the fault. If it is replaced by daytime maintenance, all complaints about service (equipment) downtime 1 hour will come over, which is very stressful.

In fact, the principle is simple: just like we are busy with many things during the day, like porters, we keep carrying goods into the warehouse. Only when all the goods are shipped out can we start sorting out these goods and sorting out the warehouse.

Secondly, during the day, the server is actually in the "porter" state of real-time data processing. Only after the real-time data processing work (processing work) is completed can we have the opportunity or ability to make room for data induction and collation. Therefore, the downtime of the server usually occurs at the lowest utilization period. that's it