The Magic block device timed out to process Hang IO for up to 6 minutes. Which one can tolerate the Bug of general OS?
毕须说  2024-07-30 16:16   published in China

Recently, I visited a financial customer and reported that after a domestic database was published, the business was suspended for more than 6 minutes due to the timeout of an NVME SSD. The analysis log shows that NVME SSD hangs, and IO enters the linux block device timeout processing mechanism. The common OS processing mechanism is as follows:

1) the default timeout period after I/O is issued is 30 seconds, and I/O does not come back to start timeout processing;

2) initiate Abort of this IO and tell the lower driver and disk to discard this IO. The timeout period of abort is 30 seconds;

3) If Abort fails, further execute Controller reset to notify SSD master master soft reset. The timeout period of this operation is 12 seconds;

after the previous action is successfully executed, the I/O will be put into the queue and sent to the disk again. If it times out again, the timeout process will start again. The maximum number of retries is 5. That is to say, the time that IO hangs here may be (30 30 12) seconds * 5 times = 7 minutes. During this period, IO cannot return to the upper-layer database, as a result, the entire RAID group hangs and the business hangs.

In fact, this is a big Bug of Linux OS block device timeout. I remember that this I/O timeout problem occurred frequently in the storage backend system in 2009, the problem 14 years ago did not expect to happen again today when the distributed architecture is very popular, because a large number of services use server local disks, a large number of business I/O access, if an SSD disk times out or a BUG causes the entire business to hang out, which can be tolerated?

In fact, the storage system was reformed 14 years ago when this BUG was found, and the reliability was enhanced:

1) first, the host I/O is sent to the storage system. In the front-end LUN cache, write images and read hits are implemented to return to the host. Only when some I/O hits are sent to the access disk.

2) the most important thing is that the storage system has a mechanism to detect slow disks. If the I/O of the disk exceeds a few seconds, it will start storage RAID degradation read/write, use other RAID member disks to calculate the data and return it to the upper layer. Ensure service recovery before troubleshooting.

3) the background system continues to detect,

3.1) when the I/O timeout of this disk reaches Y seconds (originally 30 seconds), the timeout error handling process enters,

3.2) still abort this IO first (12 seconds timeout),

3.3) if the SSD master master master soft reset fails,

3.4) if it still fails, reset the power-on hardware again. After successful reset, run the command to check whether the I/O latency is normal. If it is normal, connect to the system. Otherwise, isolate the disk offline.

All these error handling processes are performed in the background. The front-end IO is returned through RAID degradation and read/write. The IO is restored first and then repaired, instead of repairing the fault before restoring the IO according to the general OS, this will not cause the business to Hang.

Therefore, a large number of unreliable local server disks are used in the current distributed system, and the error handling mechanism has not been modified. A single disk hangs up and the system takes a long time to repair, resulting in service interruption. The R & D process of server SSD is not standardized in many small manufacturers. The manufacturing process and quality control of incoming materials are not up to standard, so it is normal to generate bugs. Is the database manufacturer responsible for this problem? Or server, OS, or disk manufacturer?

Maybe the database will say, the lower level hang for a few minutes, the database level to check the active mechanism, dozens of seconds to the Slave node, is there no problem? In this way, if the master node does not fail to disk, will the cached data fail to disk, resulting in data loss? Even the data is inconsistent? This is even more fatal.

The best processing mechanism for the lower-layer suspension must be to recover the business first, and then quickly isolate it. The fault is a fast convergence closed loop, rather than a node-level fault due to the spread of a disk fault, will the cluster collapse further. If the external enterprise storage LUN of the database is used, similar faults can be perfectly solved, with higher reliability, higher utilization rate and better performance Latency. Why use an unreliable server disk? This is also the core logic of A country's architecture. Professional people do professional things, and professional fields need professional companies to continuously master.

 

Source: Bi xunshuo

Replies(
Sort By   
Reply
Reply