Pod failover stands out as one of Kubernetes's paramount features, empowering flexible scheduling and ensuring high availability. Let's delve into the selection of container storage through the lens of pod failover.
To start, we need to succinctly explain what pod failover is. In Kubernetes, microservices are crafted from pods. Each pod comprises one or more containers, serving as the smallest unit for Kubernetes resource management. Kubernetes provides a variety of controllers, such as ReplicaSet, Deployment, and StatefulSet. All these pod controllers maintain a certain number of pod replicas. If a pod becomes inaccessible on a node, a controller persistently retries the access, seeking a response. If there's no response after a certain period, the controller will evict the pod and attempt to start a new pod. If the inaccessibility results from a node failure, the pod will be restarted on another operational node. This process is called pod failover. It ensures automatic recovery after the failure of a single node, maintaining high service availability.
In its early days, Kubernetes was tailored for applications not requiring persistent data, simply discarding data by default upon pod deletion. Therefore, data access subsequent to pod failover does not need to be considered. However, as containerized applications evolve, persistent data storage becomes indispensable. Ensuring data accessibility upon pod failover emerges as a pivotal issue for Kubernetes. Next, let's explore which container storage option best caters to data access needs upon pod failover for both stateless and stateful applications.
Let's begin with stateless applications, which are typically deployed using the Deployment controller. Conceived under the assumption of uniformity and absence of priority among pods, and with no need to maintain session state data, Deployments exhibit indifference towards data consistency across diverse pods. Without specific scheduling affinity configurations, Deployments can start a pod on any node. However, Deployments still require persistent volumes (PVs). Take Jenkins, a plugin commonly employed for development and testing, as an example. Jenkins is typically deployed via Deployments, but Jenkins tasks, builds, and accounts are stored as files, thereby requiring PVs for data persistence. The following lists some container storage options.
Local disks of servers: If a hostPath PV functions as the directory for data persistence and once a node is faulty, all pods deployed on the node will fail over to another node. However, the destination node does not have the original node's data, rendering the data inaccessible upon the failover. In instances where a LocalPV is adopted, the destination node may not have the same path defined in a template, which could impede pod failover. In both cases, local disks of servers do not appear to be the optimal choice for Deployments.
External block storage (such as iSCSI SAN): By virtue of the one-to-one mapping between LUNs and hosts, Kubernetes adopts the Attach/Detach controller to ensure exclusive read and write access to a PV by a single host. The Attach/Detach controller establishes an attachment relationship between a PV of external block storage and a host to prevent other hosts from using the PV. If a node is faulty, its pods need to fail over to another node. However, because a PV has been attached to the faulty node, the controller needs to wait for a 6-minute timeout (defined by the MaxWaitForUnmountDuration parameter, which is set to 6 minutes by default and cannot be modified) before detaching the PV from the faulty node, markedly impacting the pod failover speed. Hence, block storage does not present itself as a prudent choice either for Deployments.
External file storage (such as NFS): As file storage supports shares, ReadWriteMany (RWX) can be configured for it in Kubernetes. Therefore, file storage can be mounted to multiple hosts. Once a node is faulty, its pods can fail over to another node and a container storage interface (CSI) can schedule and mount a PV to the destination node. During pod failover, there is no need to wait for PV scheduling. In this way, failover can be completed within 1 minute (depending on related parameter configurations), greatly improving service reliability. Therefore, NAS is the most friendly choice for pod failover in Deployment mode.
Now, let's examine stateful applications, which are commonly deployed using the StatefulSet controller. Unlike a Deployment, a StatefulSet entails master-slave relationships between pods, each housing distinct data. For instance, containerized MySQL is typically deployed using a StatefulSet.
Let's take a look at the troubleshooting mechanism of StatefulSets. Master and slave pods of a StatefulSet do not share data. Instead, a unique PV is assigned to each pod and identified by the pod name. If a node is faulty and is disconnected from the cluster management node, the StatefulSet prioritizes a reconnection attempt and waits for a period of time. If the connection still fails, the pod on the faulty node is marked as "Delete" but is not deleted immediately. Instead, the pod is deleted after the node is connected again. Before the pod is deleted, the cluster does not start a new pod. Why is the StatefulSet designed in this way? The StatefulSet design considers that the disconnection between the pod and management node does not mean that the pod stops running. In this case, starting a new pod and mapping a PV to it may cause application read/write conflicts or even data corruption. And if no automation script is available, restoring services by using pod failover requires manual eviction of the faulty node. What are the impacts of different storage types on pod failover in this scenario?
Local disks of servers: Given that data solely resides on local nodes, pods cannot fail over. Take MySQL as an illustration, usually configured with one master and multiple slaves. In the event of master failure, subsequent to the master/slave switchover, you need to manually create a new instance, wait until full data construction is completed in the instance, and then add the instance to the cluster as a slave. The cluster remains in the protection-degraded state during the full data construction process which could span several hours. It is evident that employing local disks for stateful applications is not advisable.
External block storage: As previously mentioned, if a single node becomes faulty, a controller awaits a 6-minute timeout before re-attaching a volume to the destination node where a pod is scheduled to. For StatefulSets, even with manual node eviction, approximately 7 minutes elapse from the eviction to the pod being fully started on a new node.
External file storage: After a node is faulty and then evicted, its pods fail over to a new node and a PV can be automatically mounted to a host. The failover takes only about 1 minute, a significant improvement compared with block storage. For stateful containers, employing NAS as the storage foundation enables rapid pod failover, thereby enhancing availability.
In conclusion, irrespective of whether it pertains to stateless or stateful applications, NAS emerges as the optimal choice for achieving rapid pod failover. In essence, shared file systems are imperative for cross-node sharing. Analogous to VMware's development of Virtual Machine File System (VMFS) to enable multiple virtual machines to share a file system and the implementation of high availability (HA) and fault tolerance (FT) based on VMFS, Kubernetes similarly necessitates a shared file system to furnish cross-node access capabilities. Enterprise-class NAS storage offers high performance, availability, and resilience, rendering it the premier choice for containerized persistent storage.