With the vigorous development of the digital economy and the application of new technologies such as autonomous driving, big data, artificial intelligence and high-performance computing, data has shown explosive growth, and data forms and access loads have also shown a diversified trend, higher requirements are put forward for storage capacity and performance. Scale Out distributed storage is gradually becoming the best base for massive and diversified data applications due to its features of high scalability, multi-protocol access, high performance, high reliability, and open ecology. At the same time, with the increase of data volume, the green energy saving of storage has gradually become the key demand. Compared with traditional mechanical hard disks, flash memory has many incomparable advantages. Under the same capacity, flash memory technology can save more than 65 percent of energy than mechanical hard disks, and the performance of flash memory is higher than that of mechanical hard disks. Flash memory can also help to improve the application performance while reducing the overall storage energy consumption, so as to speed up the development of data economy.
Although flash memory has many advantages, it still poses a big challenge to fully flash it. For example, compared with mechanical hard disks, flash memory is still more expensive and has limited service life. Let's take a look at how OceanStor Pacific Scale out storage can meet this challenge.
Basic concepts and technical difficulties basic working principle of SSD
in order to better understand the life of SSD disks, here are some basic principles of SSD disks. NAND Flash is generally used as the storage medium in SSD. The two main concepts in Nand Flash are Block and Page. A Block consists of several pages, block is the smallest unit for Nand Flash erase operations, and Page is the smallest unit for Nand Flash read and write operations. When a Page is written to a new location, it can be written directly. However, if it needs to be rewritten, an additional erase action is required. The unit of erase is Block granularity. When a Block unit is erased repeatedly for a long time, it will not only cause write performance problems, but also greatly shorten the service life of SSD. To solve this problem, the Wear levering mechanism is introduced inside the SSD disk, which relies on retaining a portion of the OP space (Over Provisioning) in the SSD to delay erasure through remapping, reduce the balance loss. When a page needs to be rewritten, a new block is written, and the old page is used as Garbage data, waiting for the background task to periodically GC Garbage Collection. Generally, more OP space is reserved in an SSD disk to balance the lifetime and performance of the disk.
Append mechanism
the previous chapter briefly introduces the Wear levering mechanism of SSD disks. The Wear levering mechanism reduces the write amplification of SSD disks and improves the disk life to a certain extent, however, if the upper-layer storage software is still in the traditional write in place mode, the GC process in the disk will be aggravated, thus affecting the performance and service life. Using ROW(Redirect On Write) to append only to Write, it can better suppress GC in the disk and reduce Write amplification. As shown in the following figure, each time a Write is overwritten, a new location is written, it does not overwrite the old write location. The advantage of this is that GC in the disk is suppressed as much as possible. The disadvantage is that the storage system needs to add a GC garbage collection mechanism. How to balance these two layers of GC is a great challenge for upper-layer storage systems, especially large-scale systems that Scale out.
NVMe SSD
most mechanical hard disks use SATA/SAS interfaces and AHCI protocols. The maximum queue depth is 32, which means that 32 commands can be received and processed simultaneously. In the early days, the performance bottleneck of storage systems was mechanical disks rather than protocols and access interfaces. With the development of SSD media technology (NAND Flash media has much higher performance than traditional mechanical disks), the performance of the disk is getting faster and faster, and the bottleneck is shifted to the access protocol and the access interface. Therefore, NVMe appears. The full name of NVMe is Non-Volatile Memory Express. The NVMe standard is PCIeSSD-oriented, and the PCIe channel is directly connected to the CPU. The traditional way is to transfer to the CPU through the Nanqiao controller, at the same time, NVMe can increase the maximum queue depth from 32 to 64000. The huge performance improvement makes the I/O bottleneck shift from the disk upward, traditional storage software stacks can no longer maximize the performance of NVMe SSD disks.
OceanStor Pacific Solution
the previous chapter briefly describes the basic principles of the SSD disk and the challenges faced by the storage system. How to maximize the performance of the SSD disk, it also solves the problem of SSD disk life and cost, which is a big challenge for storage systems. OceanStor Pacific has also made some attempts.
Numerical control separation architecture
open the traditional IO path, from the application to the disk medium that actually stores data, it needs to go through many parts processing steps. On this path, there are also Protocol walls, IO walls, the memory wall and computing power wall together form a huge bottleneck for improving storage performance. Finally, NVMe SSD has great performance advantages that cannot be brought into play. OceanStor Pacific adopts a numerically-controlled separation architecture. Data is directly transferred from the NIC to the disk, eliminating intermediate multi-component/multi-protocol processing, avoiding intermediate CPU participation, and eliminating multiple memory copies on the IO path, A highway that implements data. Metadata adopts the global shard splitting strategy, allowing all nodes to process metadata evenly, avoiding the bottleneck of metadata operations, and supporting the scale out and expansion of massive amounts of data.
Based on the SSD principle mentioned above, OceanStor Pacific adopts the Append mechanism for data persistence, that is, all writes are allocated SSD space, and old data needs to be garbage collected in the storage system. OceanStor Pacific uses the Global GC mechanism supported by intelligent algorithms to accurately identify the amount of garbage globally and perfectly inhibit the occurrence of intra-disk GC without affecting the front-end business, therefore, the disk only needs to reserve a small amount of OP space and release the space for users to use, thus improving the service life of the disk and further reducing the cost of the disk.
LSM index
the preceding ROW mechanism solves the problem of data overwriting and writing, but it will incur a large amount of metadata management overhead. A large amount of metadata will still affect the SSD disk life, to reduce the number of times metadata is downloaded and improve the disk writing efficiency of metadata, OceanStor Pacific uses LSM tree(Log-Structure Merge Tree) to manage metadata, converts a large number of metadata modifications into large sequential writes, making full use of SSD write performance. As shown in the following figure, to insert and modify a large amount of metadata, first enter the memory memtable, then aggregate it into large blocks and serialize it to the SSD. L0->L1->L_last merges layer by layer. When each layer merges to the next layer, a new tree is generated. After the merge is completed, the metadata blocks of the current layer are released as a whole, for SSD, it is basically continuous large-Block writing, continuous large-Block release, distribution and release of the entire Block level, greatly reducing garbage collection.
In short, OceanStor Pacific introduced LSM tree to merge a large number of small pieces of metadata into large pieces of sequential writes. Metadata is compressed before the actual disk is downloaded to reduce the number of I/O operations, convert IOPS bottlenecks into throughput bottlenecks to improve write performance.
As mentioned above, the storage system performs efficient GC to suppress intra-disk GC, which not only improves the performance, but also allows the disk to release more OP reserved space for users, however, common storage systems cannot be deeply linked with standard SSD controllers. For example, the on-disk Wear levering mechanism described in the preceding section cannot be well perceived by upper-layer storage systems, as a result, when multiple hybrid services are run at the same time, IO will compete for SSD disks, and the upper and lower layers of GC will still run at the same time. Not only can the performance of disks not be 100%, SSD disks also reserve more OP space for garbage collection, greatly reducing the cost performance of the system. OceanStor Pacific adopts self-developed palm/half palm SSD. The storage system and self-developed SSD controller chip use the FlashLink technology combined with disk control. The self-developed disk provides an original linkage interface, and supports fine-grained interaction with storage software. It allows you to set priorities by business type and identify hot and cold data diversion. Data with high probability of garbage generation is defined as hot data, the data with lower modification frequency is defined as cold data. Combined with the disk multi-stream technology, the cold and hot data is stored in different blocks, increasing the probability that the data in the Block is invalid at the same time, to reduce the amount of data moved during garbage collection, while giving full play to the SSD disk performance, effectively inhibit the garbage collection operation in the disk, and free up more OP space for users to use. Data reduction
the above technologies are mainly aimed at Disk Control, suppressing or avoiding In-disk GC, releasing more in-disk OP space for users, and reducing SSD disk costs while exerting SSD disk performance, increase the service life of the disk. However, for SSD disks, controlling the IO Number of the lower disk is the essence of improving the SSD disk life. To solve this problem, oceanStor Pacific products provide adaptive global re-deletion of foreground and background, and compression capability based on the self-developed HZ10 efficient algorithm to minimize the number of I/O disks.
Summary
OceanStor Pacific products maximize the performance of SSD disks through numerical control separation architecture, Append Only write mode, Global GC, LSM tree-based indexing technology, and online data reduction technology, the life of the balance plate. However, OceanStor Pacific will continue to try lower-cost SSD disks (QLC) and more in-depth disk control combined with black technology to further balance the performance and lifetime of SSD disks, achieve the goal of replacing HDD in the whole scenario and promote the green and sustainable development of the digital economy.