The article is reprinted from: Data storage Zhang public account
as SSD prices continue to decline, SSD accounts for an increasing proportion of the entire storage product. Many companies have introduced all-flash disk arrays. Such as EMC PowerStore, Huawei Dorado, and Pure Storage. In appearance, unlike mechanical hard disks, SSD hard disks only have disk-like specifications. There are many SSD specifications, such as PCIe card specifications, M.2 specifications, and disk specifications, as shown in the figure.
We will not introduce the appearance too much. Today, we will go deep into the internal SSD and introduce its internal hardware and software architecture in more detail, as well as its data read and write related features.
Internal structure of SSD
the internal structure of the SSD hard disk is shown in the figure. The internal structure includes three core components: controller, cache, and flash particles. Among them, the controller is similar to the CPU of our computer, the cache is similar to the memory, and the flash memory particles are persistent storage media, similar to disks of mechanical hard disks.
In addition to the above core components, an interface is required to establish a connection with the host, which is called the host interface. Interfaces have different specifications, such as SATA, SAS, and NVMe.
Flash controller
in fact, the SSD device itself is a mini computer. The controller can be understood as its CPU, which is the heart and brain of the SSD. The controller establishes the association between the host and other components in the SSD. When the host wants to send data to the SSD, the flash controller will indicate the data flow to ensure reliable storage and retrieval. It also includes managing SSD firmware and performing background processing, such as managing flash file systems, loss balancing, error correction, pruning, and garbage collection.
Volatile memory
this component is similar to the memory in a computer. It is also SSD memory and is used as temporary data storage. Not all SSDs have this component, but enterprise SSDs usually have this component. Because it is volatile, electricity is needed to retain information. The firmware in the controller determines when data is refreshed from the volatile (non-persistent) memory or moved to the non-volatile (persistent) flash memory. In case of unexpected power failure, data in the cache may be lost or damaged unless an effective power failure protection mechanism is available.
Non-volatile flash memory
data is permanently stored in NAND flash memory chips and retrieved from them. They are non-volatile because they retain data even if the SSD does not have a power supply.
Logical Structure of SSD
even a very small flash memory particle is very complicated inside. As shown in the figure, we enlarge a flash memory particle, and we can see its main internal composition (here is the logical composition). Understanding this part is very helpful for us to understand the following access features of SSD.
The chip particles we see are actually encapsulated by flash memory, and what really works is the internal crystal chip. There may be one or more crystal chips in each package, which are called Die. Each Die has several planes, and there are many blocks in the Plane. The Block contains a Page. Block is the smallest unit for data erasure, while Page is the smallest unit for SSD read and write.
Taking a 128GB flash memory particle produced by Micron (Magnesia) as an example, each Die has 2 planes, each Plane has 1024 blocks, and each Block has 512 pages. The page size is 16kB. According to these data, we can calculate that the Block size is 16kB * 512=8MB, and the Plane size is 8MB * 1024=8GB, so the Die size is 16GB, that is 128GB.
It can be seen that the minimum write unit of the above chip is 16KB, while the minimum erase unit is 8MB. We need to pay attention to these contents and pay great attention to them in future software design.
SSD Access features
as mentioned earlier the minimum unit for SSD reading and writing is Page, and the minimum unit for erasing is Block. For example, when an application writes data, the minimum granularity is 16kB. Even if one byte of data is written, the SSD occupies 16KB of space. At the same time, if you want to continue writing 1 byte of data in adjacent locations, you cannot rewrite the original page. You must select a new page, the original page is also marked as released. After all the pages in a Block are released, we cannot rewrite any of the pages directly. We need to erase the entire 8MB Block before we can rewrite the data. The size of the above pages and blocks is related to the specifications of the device, and the products of different manufacturers are different.
The read/write characteristics of SSD are quite different from those of mechanical disks. For mechanical disks, the minimum access granularity is sector (512B), and data can be rewritten in situ. However, SSD cannot modify data in situ, and the minimum granularity of data written is much larger than that of mechanical magnetic disks.
Due to the above two features, there will be corresponding problems when accessing SSD. Take the write granularity of 16kB as an example. If the business I/O size is very small, for example, about 1kB and is random I/O. Due to the large position offset and small data granularity, 1kB of data occupies 16KB of pages, resulting in a large waste of space.
As shown in the figure, the first IO will use a page to store data, and the second IO will need to find a new page to store data. By analogy, when the entire block (8MB) is full, 512KB of data (8MB/16) is actually written.
As mentioned above, SSD cannot modify data in situ. If the modified content is smaller than the page size, write amplification will occur. If there is any remaining page space inside the block, a new page will be allocated to carry the data when modifying the data. As shown in the figure, the green box is the original data. When the data is modified, a new page will be allocated to store the modified data, and the original page will be marked as released.
For the above process, because the modified data may be very small (for example, 1kB), to ensure that the original page data is not lost, first read the data and then merge the data, finally, the new location is written.
It is precisely because of these read/write features that SSD hard disks have read/write amplification problems.. Take write amplification as an example. If you write 10B of data at the user level, you need to write a Page, that is, 16KB of data, to the SSD. If you need to modify the data, you need to read the data from the original location before you can write the data to the new location.
Another reason for write amplification is Space release (GC). Because SSD cannot be modified in situ, when a page in a block is used and released, its space is not available. The entire block must be erased before it can be reused. The problem is that it is difficult to ensure that all pages are released in one block, so you need to migrate the pages with data to other blocks before erasing them.
As shown in the figure, when most pages of the two blocks in the SSD are released (this article is an example, the actual situation is much more complicated), the SSD controller performs garbage collection (GC). At this point, the valid data in the two blocks is moved to the new block. After the data is moved, the controller erases the previous block. Because some data is always migrated when the space is released, this will result in additional data read and write operations.
Based on the above features, comprehensive consideration is required when accessing SSD at the application level. For example, when writing data, try to make up a page and avoid modifying it in situ. In addition, SSD also has a limit on the number of erasure times, that is, when The Block erasure reaches the specified number of erasure times, data can no longer be stored. Based on the above features, the existing file system and RAID design are not suitable for SSD hard disks. Take Ext4 as an example. When accessing a file, information such as the access time and file size in the inode node is frequently updated, which causes SSD wear to worsen.
Today, we mainly introduce the internal structure and write amplification of SSD. These features are very helpful for us to use SSD (SSD-based development). We will introduce how to deal with these features of SSD at the software level with examples later.