Construction of WeBank's Next-Gen Data DR and Backup System
Panpan Hu, Database Platform Owner at WeBank
I. Abstract
WeBank, China's first digital bank, has served more than 400 million individual customers and over 5 million micro, small, and medium-sized enterprises seeking loans over the past 10 years. As the bank experiences rapid data growth, robust data backup is crucial to its IT systems, serving as the last line of defense against faults and disasters. WeBank's continuous expansion of its banking services necessitates an innovative, iterative data backup and recovery system that reduces storage costs and improves recovery efficiency.
To address this need, WeBank has developed a next-gen, ultra-large-scale database DR and backup system for its core banking systems. Built on WeBank's proprietary backup management and control platform and data archiving platform, Huawei OceanProtect storage products, and advanced Blu-ray storage systems, this new solution incorporates cutting-edge backup technologies. These include mounting of copies in native format, forever incremental backup, and patented deduplication and compression technologies, which together have boosted backup/recovery efficiency by 50%, enabled TB-level data recovery within two minutes, and reduced storage footprint by 75%. Additionally, backup copies in the DR and backup system are reused to support simulated production IT-based labs, allowing WeBank to wring more value out of backup data. In large-scale recovery verification scenarios, such as service version verification, DR drills, year-end settlement drills, and quarterly interest settlements, recovery time has been reduced from 4.5–7 days to just 1 day, improving efficiency by 77–85%. Advanced core database and DR and backup hardware and software ensure the effectiveness of the financial DR and backup system.
II. Background and Requirements of Data DR and Backup
1. Data backup compliance requirements
In China, there are regulations governing information system construction in key sectors, including finance, public services, and telecommunications. These regulations require enterprises to implement effective data protection measures to ensure data confidentiality, integrity, and availability. The regulations cover DR and data backup capabilities and aim to ensure that critical data can be quickly and effectively recovered in case of emergencies. National standards and regulations for DR and backup construction and data resilience are set out in Disaster Recovery Specifications for Information Systems (GB/T 20988-2007), Information Security Technology—Baseline for classified protection of cybersecurity (GB/T 22239-2019), Cybersecurity Law, and Data Security Law. These regulations lay a solid foundation for DR, backup, and data resilience within China's information systems, with the goal of enhancing data protection and ensuring a resilient and stable critical infrastructure.
The financial industry has responded to these regulations by developing key standards to enhance data resilience and integration. These include the Implementation Guidelines for Classified Protection of Cybersecurity of the Financial Industry (JR/T 0071-2020), Financial Application Specification of Distributed Database Technology—Disaster Recovery Requirements (JR/T 0205-2020), and Management Specification of Information System Disaster Recovery for Banks (JR/T 0044-2008). These guidelines aim to elevate the data management and security protection level within the financial sector and ensure resilient, recoverable financial data.
2. Pain points and requirements of WeBank's data DR and backup system
WeBank handled massive volumes of data with over 900 database instances in use. The daily full backup data totaled 600 TB, while daily incremental backups amounted to 50 TB. Additionally, the daily binlog data volume was 30 TB, bringing the total inventory backup data to 12 PB.
WeBank faced significant challenges in ensuring that its backup systems could deliver sufficiently high reliability and rapid recovery if faults impacted critical operations such as service version verification, end-of-day batch processing, quarterly interest settlements, and year-end settlement drills. In its day-to-day operations, WeBank was encountering several key pain points:
A growing gap between backup and recovery efficiency and business growth: Due to a significant increase in service volume, full backups took over 20 hours, single database recovery took more than 2 hours, and DR drills averaged 7 days. As a result, WeBank could not support quick service version verification and data recovery.
High construction costs because of a sharp increase in backup capacity: The three-copy policy of the Ceph cluster resulted in low disk utilization. In fact, WeBank's storage usage rate was a mere 33%. The absence of data deduplication capabilities and effective hot/cold data separation had implications for footprint, energy consumption, and O&M, driving up the TCO.
Frequent ransomware attacks and a lack of systematic protection: Recent years have seen an unprecedented increase in the frequency of ransomware attacks in the industry, resulting in significant risks to data resilience. WeBank needed to introduce systematic protection measures, including ransomware detection, anti-tampering, data encryption, and automated response mechanisms, to ensure data resilience and regulatory compliance.
To tackle these challenges, WeBank has developed a next-gen, ultra-large-scale database backup and recovery system for its core banking systems. This new solution employs WeBank's proprietary backup management and control platform and data archiving platform, Huawei data storage products, and advanced Blu-ray storage systems. It offers a comprehensive suite of data backup and recovery systems, including DR, backup, archiving, resilience, and data recovery. As a result, the system enhances the efficiency of data backup and recovery, reduces backup storage costs, and ensures service continuity and recoverability.
III. Solution Architecture
1. Legacy solution architecture and its disadvantages
As shown in Figure 1, WeBank utilized Ceph, an open-source distributed file system, for backup data storage within its legacy backup architecture. This involved deploying standard x86 servers coupled with SATA HDDs to establish a Ceph distributed storage cluster, which employed a three-copy mechanism to guarantee high data reliability and availability. Regarding backup policies, a full backup was conducted every Sunday, with incremental backups performed from Monday to Saturday, and database binlogs backed up in real time every five minutes. Regarding data retention policies, full backups, incremental backups, and full binlog backups going back three months were retained. For backups generated between three and six months ago, full binlog backups and full backups of the last week of each month were retained. For backups older than six months, full backups of the last week of each month were retained.
At the time, the total volume of existing backup data had reached 12 PB, with incremental backups generating hundreds of TB of data every day, stored on a system utilizing seven Ceph storage clusters and hundreds of servers. However, as previously mentioned, the continuous growth of backup data meant that the legacy architecture's days were numbered. Challenges included inefficiencies that meant backup and recovery struggled to keep pace with service growth, significantly increased construction costs driven by rising backup capacity, and vulnerability to ransomware attacks due to a lack of systematic protection. Consequently, there was an urgent need for reconstruction and optimization.
Figure 1 Legacy data DR and backup solution architecture
2. New solution architecture
In its new data DR and backup solution, WeBank integrates professional storage devices and systems, including Huawei OceanProtect backup clusters, archive storage, and Blu-ray storage, as shown in Figure 2. The bank uses its self-developed database backup management and control platform for unified management and scheduling. The next-gen data DR and backup system significantly enhances data backup and recovery efficiency while reducing storage costs by employing advanced technologies such as data compression, deduplication, live mount, and multi-media tiered storage. Additionally, the solution incorporates a ransomware protection system with a physical isolation mechanism, ensuring golden copies of data backups and enhancing data resilience in extreme scenarios.
Figure 2 Next-gen data DR and backup solution architecture
(1) Core modules
The following outlines the key modules of the next-gen DR and backup system architecture:
Backup management and control platform: This platform integrates essential system functions, including backup scheduling, recovery, archiving, policy updates, data management, monitoring, and access control. It supports detailed backup records for monitoring and reporting, helping enterprises with auditing compliance and planning. The real-time monitoring engine tracks tasks, analyzes performance, predicts risks, and optimizes strategies. The user interface and APIs enhance ease of operation and streamline integration. Overall, this platform enhances data resilience and backup management efficiency, acting as the central hub for the entire DR and backup system by implementing resource monitoring, job scheduling, and security auditing for full-process data protection.
Database DR cluster: The cross-city DR and backup cluster for the production database cluster provides geographical redundancy and DR capabilities. It employs an active node and a passive node. The DR active node offers read-only access during regular operations, supporting query and reporting tasks, while alleviating the load on the primary database. The passive node handles regular data backups, ensuring that data can be quickly recovered to the most recent state when necessary.
OceanProtect cluster: The cluster offers critical functions to ensure high performance, reliability, and resilience for data backup and recovery. It handles high-performance data backup and recovery, live mount of backup copies, deduplication and compression of backup files, encryption, ransomware protection, and recovery data anonymization. Its scale-out architecture allows for the addition of storage nodes to expand capacity and performance, effectively meeting the demands of growing data volumes.
Simulated production verification environment: This simulation environment, based on the TDSQL simulated production cluster, is designed for secure service simulation and verification to protect production data. It supports critical financial operations such as year-end settlement, quarterly interest settlement, and major service version verification to guarantee accurate and compliant accounting. Server pooling and single-node deployment enable efficient and on-demand resource allocation at low costs, meeting service requirements in non-production environments.
Archive storage: The WeBank-developed S3-compatible storage solution features high availability, durability, low costs, and scalability. It is primarily utilized for archiving infrequently accessed warm data, thereby effectively reducing storage expenses.
Blu-ray storage: Blu-ray storage is an ideal and regulation-compliant medium for long-term retention of cold data. Historical data is migrated from object storage to Blu-ray optical discs, providing a cost-effective solution for data backup.
Ransomware protection system: The system employs Air Gap technology for physical isolation, ensuring independent storage of full backup data and full binlog backups going back one month, thus securing the retention of golden copies.
(2) Introduction to tiered storage policies
For copies that require long-term retention, WeBank's system implements a tiering policy, archiving this data to low-cost storage media. This approach conserves the resources of the high-performance backup pool in the backup appliance, while also lowering construction costs and enhancing data resilience.
The OceanProtect backup storage cluster is used to retain continuous backup snapshots and full binlog backups going back three months. OceanProtect offers high-performance data deduplication and compression, enabling frequent, rapid backup and recovery while keeping the costs under control.
For backups generated between three and six months ago, WeBank retains the full backups of the last week of each month and full binlog backups. This data is then stored in WeBank-developed archive storage via the S3 protocol as long-term backup copies of warm data by using the backup appliance.
For copies older than six months, WeBank retains full backups of the last week of each month. This data is transferred to Blu-ray storage using the backup appliance and archive storage, serving as permanent backup copies of cold data.
(3) Service value of backup data in the simulated production verification environment
Conventional backup software creates backup sets in the private format, and recovery typically takes hours, hindering business operations. This results in low utilization of the backup system, with backup data becoming cold or dead data, causing substantial costs. To improve backup system utilization and unlock the value of backup data, native format copy mounting and copy anonymization technologies are adopted. This enables minute-level service data recovery while preventing breaches of users' private data. The technologies are currently used in the following application scenarios:
Verification of production service versions: Native format copy mounting enables minute-level service startup in the simulated production service environment and quickly replays and verifies major service versions and batch cutoffs according to production data.
Quarterly/Annual DR drills: Parallel recovery of mass data is enabled thanks to the automatic drill orchestration function of databases. The entire network data recovery can be completed within one day, quickly verifying the reliability of backup copies and generating drill reports to adhere to safety and regulatory requirements.
Rapid data recovery: In a simulated production environment, data lost due to accidental deletion or other reasons can be restored within 2 minutes, slashing business downtime, enhancing instant data availability, and ensuring data integrity as well as business continuity.
IV. POC Rollout and Problem Resolution
1. POC rollout milestones
(1) In October 2023, an in-depth study was conducted on available next-gen backup solutions to evaluate products from different vendors, including their technical performance, cost-effectiveness, and compatibility with existing systems, which helped identify the most suitable backup technology for the bank's current and future needs.
(2) In December 2023, database versions fit for the bank's environment were deployed to complete function verification of nine categories and 19 sub-categories in the production service scenario.
(3) In May 2024, tests on the OceanProtect device's performance (backup/recovery/live mount), functions (data reduction/active-active/data resilience/archive), and reliability (disk/controller) were completed. All test items performed as expected.
(4) In September 2024, a POC test was rolled out for the OceanProtect device in WeBank's DR and backup environment to verify its functionality and performance against established benchmarks, and ensure that it supports high-performance backup and recovery for more than half of the 400+ DR environment instances.
(5) In December 2024, following over two months of intensive work, the adaptation of the OceanProtect Appliance's performance and functionality in the DR environment was completed. Over 400 database instances were successfully backed up and the device ran smoothly. Currently, both new and legacy backup systems are operated in parallel, with the second OceanProtect device undergoing installation and deployment.
2. Core test cases
The core test cases, totaling 19 items across nine categories, have been summarized from three dimensions: functionality, performance, and availability, as shown in the table below:
3. Key POC data
(1) Backup write bandwidth: The backup write bandwidth determines overall backup efficiency. The production environment of WeBank consists of up to 900 database instances, requiring a full backup to be completed within 24 hours. Multiple tests have verified that a single OceanProtect device meets the required average backup write bandwidth.
(2) Data deduplication and compression ratio: In the database backup scenario, where data files are primarily incrementally modified, there is typically mass duplicate data that requires a high deduplication and compression ratio. Multiple tests show that the deduplication and compression ratio of the backup data reaches 20:1, meeting expectations.
(3) Average backup duration: Tests demonstrate that concurrent backups of 40 database instances, totaling 30 TB, achieve the expected average duration.
(4) Average backup and recovery duration: Tests demonstrate that concurrent backup and recovery of 40 database instances, totaling 30 TB, achieve the expected average duration.
4. Typical problem resolution
The database backup and recovery system itself is complex. In WeBank's large-scale backup scenario comprising over 900 instances, the POC process inevitably encounters problems. Here, two typical problems are listed for explanation.
Problem 1: The CPU usage of the scheduling module was full, causing scheduling tasks to be suspended.
In the initial POC test, only 40 database instances were registered for verification, with an agent being deployed for each instance to facilitate communication with the OceanProtect device. During this phase, scheduling tasks were functioning normally. However, as the number of registered instances was gradually increased to 300, the CPU usage of the scheduling module spiked, leading to suspended scheduling tasks.
WeBank and Huawei's R&D team worked together, and identified that the issue was caused by the frequent heartbeat reporting (once per minute) from the agent to the OceanProtect device. This led to excessive consumption of the scheduling module, resulting in an avalanche-like effect. Huawei's R&D team released a patch version that optimized the modules and codes with a high CPU impact. Additionally, the agent's heartbeat reporting frequency was adjusted from once per minute to once every five minutes. With over 300 registered instances, the CPU usage of the scheduling module was controlled within 5%.
Problem 2: Low data deduplication ratio resulted in insufficient backup bandwidth and efficiency.
After integrating over 300 database instances, 40 were randomly selected each time for backup to verify backup bandwidth and efficiency. Results showed that the deduplication ratio remained consistently low (around 2:1), resulting in insufficient overall backup bandwidth and efficiency.
Joint analysis with Huawei's R&D team found that the 40 randomly selected instances varied each time, with most of them performing full backups for the first time. Additionally, the lack of historical backup snapshots as a reference for duplicate data led to a low deduplication ratio, which was a normal occurrence, as expected.
To verify backup efficiency in a real-world scenario, 40 instances were later fixed for repeated backups. As a result, the data deduplication ratio improved significantly, and both backup bandwidth and efficiency reached the expected targets.
V. Key Technological Innovations
1. Full-stack solution
Adopting WeBank's proprietary backup management control platform and software and hardware devices such as TDSQL distributed database, OceanProtect Appliance, proprietary S3 object storage, Blu-ray archive storage, Kunpeng server, and EulerOS, the next-gen data DR and backup system is widely compatible with conventional and new backup application ecosystems. In addition, as a standard backup system architecture, the system is highly replicable with great promotion value.
2. Native format backup and live mount
The OceanProtect device supports backup in native format so that data can be stored in a way that can be identified by applications, and data reduction and encryption capabilities are transferred to the underlying storage layer. During incremental backup, incremental and full data is integrated into a complete copy that can be identified by applications. This slashes the recovery time of TB-level data sets from two hours to only two minutes, significantly improving recovery efficiency.
Figure 3 Native format
backup and live mount
3. Data deduplication and compression
By utilizing backup data preprocessing (separating backup metadata and data), multilayer inline variable-length deduplication, and feature-based compression algorithms (data cleansing, rearrangement, and deduplication based on data flow features), the system achieves a 75% overall reduction in backup storage footprint compared to conventional data reduction technologies, significantly lowering the TCO of the bank's backup data.
4. Data resilience and ransomware protection
The next-gen data DR and backup system solution fills the long-standing gap in data-layer protection within the traditional protection framework. If network and host layers fail to prevent attacks, and ransomware encrypts data stored in the system, the storage layer employs technologies like Air Gap, anti-tampering, as well as detection and analysis technologies to ensure at least one golden copy is protected against tampering in a resilient, attack-free environment. This clean data is available for safe recovery, minimizing the impact to business systems.
VI. Achievements and Benefits
1. Achievements
After nearly a year of research and POC verification, the bank has successfully implemented a parallel operational phase for the next-gen DR and backup system alongside the legacy system. One OceanProtect device has been deployed, integrating over 400 database instances. Full, incremental, and binlog backups have been completed successfully, along with multiple instances of database live mount, batch recovery, and full recovery in various business scenarios. Currently, the bank is preparing for the deployment of another OceanProtect device, with plans to complete backup integration for all database instances and achieve a full transition from the legacy to the new system by the second quarter of 2025.
2. Project benefits
The next-gen DR and backup system resolves the pain points of the conventional system in terms of cost, efficiency, backup data utilization, and data resilience.
(1) Reduced costs: By utilizing new technologies such as data compression, deduplication, and tiered storage, the deduplication and compression ratio of backup data reaches 20:1, contributing to an overall 95% reduction in backup storage footprint. Additionally, backup data stored for 3–6 months is moved to archive storage, and data backed up over six months ago is moved to Blu-ray storage, further reducing storage costs. As a result, the overall end-to-end TCO is reduced by 50%.
(2) Improved efficiency: Overall backup and recovery efficiency is significantly improved thanks to efficient unified backup management and scheduling, backup and recovery architecture for mission-critical data, as well as technologies such as live mount and tiered storage. The full backup duration is reduced from 22 hours to 10 hours, while full recovery time is shortened from 4.5–7 days to just 1 day, achieving a 77% to 85% higher efficiency. Likewise, emergency data recovery time is reduced from 2–4 hours to approximately 2 minutes.
(3) High utilization of the simulated production environment unleashing backup data value: The application of live mount and a higher efficiency of backup data recovery enable quicker snapshot-based recovery of production data in the service verification scenario of the simulated production environment. For scenarios like verification of major service versions in a simulated production environment, the legacy system required several hours to restore backup data of multiple database instances to the simulated production environment through data copying. The data decompression and copying involved in this process were time-consuming. However, with the new architecture, instances can be directly mounted to the simulated production environment in two minutes thanks to the live mount technology, eliminating the need for time-consuming decompression and copying. This immediately provides available services for version verification. For scenarios requiring large-scale full data recovery, such as DR drills, year-end settlements, and quarterly interest settlements, the improved recovery performance reduces recovery duration from seven days to about one day. By enhancing efficiency, the cold backup data is activated and turns to hot service data, unleashing the value of the backup system.
(4) Improved resilience and compliance assurance: A ransomware protection mechanism is established to retain the golden copy of data, preventing data breaches or damage caused by ransomware attacks and avoiding potential financial loss and compensation costs. The system complies with the Chinese Cybersecurity Law and Data Security Law, avoiding penalties and extra costs caused by security risks.