Pod drift is one of the most important features of Kubernetes. It is the core function of Kubernetes to implement flexible scheduling and high availability. Today, we will start from Pod drift and analyze how to select storage in containers.
First, let's briefly explain what Pod drift is. In Kubernetes, all Microservices are built by pods one by one. Pods contain one or more containers, which are the smallest units for Kubernetes resource management. In Kubernetes, whether ReplicaSet, Deployment, or StatefulSet, these Pod controllers maintain a certain number of Pod replicas based on the configuration. When a Pod on a node is inaccessible, the controller will find it and wait for its response again and again. If the Pod does not respond for a certain period of time, the controller will expel the Pod and try to pull it up again. If an inaccessible Pod is caused by a node failure, the Pod is pulled up again on another normal node. This process is called Pod drift. The Pod drift mechanism ensures that after a single node fails, the cluster automatically performs retroactive operations to maintain high service availability.
Kubernetes was designed for applications that do not require persistent data. By default, data is discarded after the Pod is deleted, without considering data access after the Pod drifts. However, with the evolution of container-borne applications, persistent data storage becomes a new requirement. How to ensure the accessibility of data after Pod drift is one of the key issues that Kubernetes needs to solve. Next, we will analyze which one to use for stateless and stateful applications respectively. Container storage it can better match the data access requirements after drift.
Look at the stateless application first. Deployment is the most commonly used controller for stateless application Deployment. It is designed based on the assumption that there is no difference or priority between pods, and does not need to save any session status data. Therefore, Deployment does not care about whether the data accessed by different pods is consistent. If scheduling affinity is not specified, Deployment can pull up pods on any node. However, Deployment does not mean that persistent volumes (PV) are not required. Take Jenkins, which is commonly used in development and testing, as an example. Jenkins is usually deployed by Deployment, while Jenkins tasks, builds, accounts, and other information are stored as files, therefore, Jenkins also needs to configure PV to implement data persistence.
• Use the local disk of the server: if HostPath is used as the persistent Directory, all Pods deployed on the node will drift to other nodes after the node fails, but the data of the original node is not on the destination node, data cannot be accessed after drift. If the PV type is LocalPV, the Pod cannot drift because the new node may not have the same path as the template definition. The above two scenarios show that the local disk of the server under Deployment is not the best choice.
• Use external block storage, such as iSCSI SAN: due to the one-to-one mapping between LUN and host, to ensure that PV is read and written by only one host, Kubernetes has designed AttachDetachController, it is responsible for establishing a Attach relationship between the external block storage PV and the host to prevent other hosts from using the PV. When a node fails, the Pod needs to drift from one node to another. However, since the PV has been attached to the failed node, the Controller needs to wait for 6 minutes (MaxWaitForUnmountDuration is set to 6 minutes by default, the PV can be Detach from the faulty node only after it times out, which has a great impact on the POD drift speed. Therefore, it is not a good choice to use block storage as PV in Deployment.
• Use external file storage (such as NFS): File storage supports sharing, and supports configuring RWX(ReadWriteMany) mode in K8S; It does not need to Attach to a host, you can directly Mount multiple hosts. If a node fails and the Pod drifts to another node, CSI can schedule the PV to be mounted to the new node immediately. During POD drift, you do not need to wait for the PV. The drift can be completed within one minute (depending on the relevant parameter configuration), greatly improving the business reliability. Therefore, in Deployment mode, using NAS as PV is the most friendly choice for Pod drift.
Next, we will discuss stateful applications. StatefulSet is commonly used as a controller for stateful applications. Different from Deployment, pods in this mode have a primary-secondary relationship, and different pods store different data. For example, MySQL containerized Deployment generally uses the StatefulSet mode.
Fault handling mechanism of StatefulSet: the primary and secondary pods of StatefulSet do not share data. It creates a PV for each Pod and uniquely identifies the Pod name. When a node fails and loses contact with the cluster management node, StatefulSet attempts to reconnect and waits for a period of time. If the connection still fails, the Pod on the node is marked as the Delete state, but will not be deleted immediately. Instead, the Pod will be deleted after the node goes online again. The cluster does not pull up new pods until they are deleted. Why is StatefulSet designed like this? The reason is that StatefulSet considers that the disconnection between the Pod and the management node does not mean that the Pod stops running. In this case, the new Pod is pulled up rashly and the PV is mapped to the new Pod, which may cause read/write conflicts or even data damage to the application. In this case, if you need to rely on POD drift to recover your business, you need to manually remove faulty nodes without automated scripts to implement Pod drift. How will different storage types affect Pod drift in this scenario?
• Use the local disk of the server: because the data is only stored on the current node, the POD cannot drift. Take MySQL as an example, it usually uses the one-Master-Multi-slave configuration. If the Master fails, after the Master-slave switch is completed, manual intervention is required to create a new instance, wait for the new instance to complete full data construction before joining the cluster as a Slave. It may take several hours to build full data, while the cluster is in the protection degradation state during the build process. It can be seen that for stateful applications, using a local server disk is not a good choice.
• Use external block storage: As mentioned earlier, when a single node fails, the Controller needs to wait 6 minutes for the timeout before re-attaching the volume to the target node scheduled by the Pod. For StatefulSet, it takes at least about 7 minutes from expelling to Pod pulling up completely from the new node even if the expelling node is manually involved.
• Use external file storage: When a node fails and is manually expelled, the Pod drifts to the new node, and the PV can be automatically mounted to the new host. It only takes about 1 minute to complete the drift, this speed is greatly improved compared with block storage. For stateful containers, NAS is used as the storage base to quickly drift pods and improve availability.
From the above comparison, we can see that NAS is the best choice for fast POD drift for both stateless and stateful applications. In essence, shared file systems are the only way to achieve cross-node sharing. VMware has developed VMFS to share a file system with multi-node virtual machines, and implemented HA, FT and other high-availability capabilities based on VMFS. Under the K8S system, the shared file system also needs to provide cross-node access. Enterprise NAS provides a high-performance, highly available, and highly secure shared file system, making it the best choice for container persistent storage.