Data storage Technology: Distributed

Data storage Technology: Distributed

When the amount of data that needs to be stored becomes large and cannot be stored on the disk of one machine, the data needs to be stored on multiple machines.
Data distribution

Ways to distribute data to different nodes include:

Hash distribution: For example, consistent hash distribution according to the primary key of the data. Sequential distribution: the data is divided into ordered ranges according to the primary key, and the data in each ordered range is stored on a node. According to the hash/sequential range, the load Balanced distribution: The distributed storage system automatically identifies nodes with high load (machine load value, CPU, memory, disk, network, QPS, thread and other resource usage), and migrates some of the data it serves to other machines to achieve automatic load balancing.
Each cluster of a distributed storage system generally has a master control node. Load balancing is achieved through master control node scheduling: the working node sends node load-related information to the main control node through heartbeat packets (sent regularly); the main control node calculates the load of the working node and the data that needs to be migrated, and generates migration tasks Put it in the migration queue and wait for execution.

Many people may think that the master control node will become the bottleneck of the distributed system, and the task P2P architecture (no central node) architecture has more advantages. However, this is not the case. Most of the mainstream distributed storage systems have master control nodes, because the master control node architecture is easier to maintain and can support thousands of clusters. Regarding the optimization of the bottleneck problem of the master control node, we will discuss it later.

The master control node is mainly used for global scheduling and metadata management (shard location information, etc.).

Of course, there are also many storage systems with no central architecture, such as Redis.
Data node high availability

Data backup/replication

In a distributed storage system, as the cluster size becomes larger and larger, the probability of failure becomes larger and larger. Faults occur every day in large-scale clusters, among which the highest probability of failures are: single machine failure and disk failure.

In order to ensure the high reliability and high availability of the distributed storage system, multiple copies of the data of each node need to be replicated and backed up:

There is generally only one primary copy, which can provide read/write services; there can be multiple backup copies, which do not serve external parties/provide read-only services. If the primary copy fails, a standby copy needs to be elected to become the new primary copy. This operation is called “election”.
The new master copy can be elected through the master control node lease agreement, distributed locks, election protocols such as Paxos protocol (usage example: ZooKeeper), etc.

Data replication between the primary and standby replicas is mainly achieved through synchronization of operation logs (Commit Log): the primary replica first synchronizes the operation log to the standby copy, and the standby copy plays back the operation log, and notifies the primary copy after completion. (For more information about operation logs, you can read “Data Storage Technology: Single Machine”)

Distributed storage systems synchronize data to multiple storage nodes through replication protocols and ensure data consistency among multiple copies.

There are two types of replication protocols: strong synchronous replication and asynchronous replication.

Strong synchronous replication: The user’s write request is required to be synchronized to the standby copy before success can be returned; if there is more than one standby copy, it can also be required to synchronize to at least several standby copies.
It can ensure strong consistency between the primary and backup copies, but when the backup copy fails, it may also block the normal write service of the storage system and affect system availability.

Asynchronous replication: The primary copy does not need to wait for a response from the standby copy. It only needs a successful local modification to notify the client of a successful write operation. At the same time, the primary replica pushes client modification operations to other replicas through asynchronous mechanisms, such as separate replication threads.
The availability is relatively good, but the data consistency between the primary and backup copies is not guaranteed. If the primary copy suffers an unrecoverable failure, the last part of the update operation may be lost.

According to the CAP principle, consistency and availability are contradictory. When designing a storage system, there is a trade-off between consistency and availability. According to actual needs, the following modes are available:

Maximum protection mode: Strong synchronous replication protocol. The strong consistency of primary and secondary copy data is guaranteed, but the availability is low. Maximum performance mode: asynchronous replication protocol. Performance and availability are guaranteed, but data may be lost. Maximum available mode: a compromise between the above two modes. Under normal circumstances, it is equivalent to the maximum protection mode; if the network between the active and backup devices fails, it switches to the maximum performance mode. This pattern makes a good trade-off between consistency and availability

isomorphic replication

Normally, during data backup, storage nodes are divided into several groups. The nodes in each group serve exactly the same data. One node is the primary replica node and the other nodes are backup replica nodes.

Since nodes within the same group serve the same data, such a system becomes a homogeneous system.

Heterogeneous replication

The problem with a homogeneous system is that when a replica node fails, the system needs to automatically add a new replica, and the amount of data that needs to be migrated is too large. Assuming that the data served by each storage node is 1TB and the internal transmission bandwidth is limited to 20MB/s, then adding a replica node will require copying data time of 1TB / 20MB/s = 50000s, which is about ten hours; due to the time required to copy data The probability of storage node failure is very high during the process, so such an architecture is difficult to automate and is not suitable for large-scale distributed storage systems.

Heterogeneous systems divide data into many shards of similar size, and multiple copies of each shard can be distributed to any storage node in the cluster. If a node fails, the original service will be restored by the entire cluster instead of a few fixed storage nodes. Since the entire cluster participates in the recovery process of the failed node, the fault recovery time is very short; and the larger the cluster size, the more obvious the advantages will be.

When the system detects a fault, it automatically uses backup data for fault recovery.

Fault detection and recovery based on master control node:

Heartbeat mechanism: The master control node sends a heartbeat packet to the working node every once in a while, such as 1 second. If the working node does not fail, it can respond normally to the heartbeat packet of the master control node; otherwise, if the master control node does not receive a response from the working node after retrying for a certain number of times, it will consider that the working node has failed. Problem with the heartbeat mechanism: The master control node does not receive the heartbeat response from the working node, and cannot determine that the working node has failed and stopped the service. During the operation of the system, various errors may occur, such as network failures and worker nodes being too busy and unable to respond to heartbeat packets. Since the master control node determines that a working node has failed, it often needs to migrate the services on the working node to other servers in the cluster. In order to ensure strong consistency, it is necessary to ensure that working nodes that do not respond to heartbeat packets no longer provide services, otherwise they will Multiple servers may provide the same data at the same time, resulting in data inconsistency. Lease mechanism: As long as the local time difference between machines is not much different (in practice, clock synchronization is performed between machines), fault detection can be performed through the lease mechanism to avoid problems with the heartbeat mechanism. The lease mechanism is an authorization with a timeout period. The master control node can issue leases to working nodes. The leases held by working nodes are only allowed to provide services within the validity period, otherwise they will actively stop services. When the lease of a working node is about to expire, you need to reapply to the master control node for a lease extension. If a working node fails or the network between it and the master control node fails, its lease will expire, so that the master control node can ensure that the working node no longer provides services (due to clock errors between machines, if the lease is valid for If it is 10 seconds, the master control node needs to add a prerequisite for determining the working node failure time. For example, the lease of the working node can be considered expired only when it is 11 seconds, and its services can be safely migrated to other servers.

Select the master based on distributed lock:

The working nodes of the primary and secondary replicas compete for the same distributed lock. The node that obtains the distributed lock becomes the primary replica. After the primary replica working node fails, the distributed lock is automatically released; other backup replicas can compete for the distributed lock again and become the primary replica. Distributed locks that provide services to the outside world also need to achieve high availability. Commonly used high-availability coordination services include: Apache ZooKeeper, Google Chubby, etc.

Usage example: Deployment across computer rooms. When the main computer room fails, in addition to manual switching to the backup computer room, the distributed lock service can also be used to implement automatic overall switching.

Cluster master election protocol self-recovery:

A cluster is formed between the primary and backup replica nodes to implement the self-election protocol. Reference: Each node in the ZooKeeper cluster implements strong data consistency and cluster high availability based on the Paxos election protocol. Advantages: No maintenance is required between the master control node and the working node. Lease, failure of the master control node will not affect the working nodes Disadvantages: Complex engineering implementation

Master control node backup

In addition to the need for working nodes to achieve high availability through data backup, automatic fault detection and recovery, the master control node itself also needs to achieve high availability.

The master control node also requires cluster deployment, and cluster master selection can also be achieved through heartbeat leases, distributed locks, etc.:

Master control node bottleneck

Memory bottleneck:
In order to respond to requests quickly, the master control node needs to maintain the metainformation of all data nodes in memory. As the number of nodes increases, memory capacity may become a performance bottleneck. Solution: Add a layer of metadata nodes between the master control machine and the worker machine. Each metadata node only maintains a part of the metadata instead of the entire distributed storage system. ; In this way, the master control node only needs to maintain the metadata of the metadata node. QPS bottleneck:
If the client must first request the master control node to obtain routing information every time before accessing data, the QPS of the master control node may become very high. Solution: The client locally caches a copy of the routing information; it can be done every once in a while/when When the routing information in the master control node is updated, the local cache is updated.

Distributed storage system paradigm

Distributed file system

The master control node needs to save file directory metadata information; files are split/merged into fixed-size data fragments, and distributed and heterogeneous are stored in data nodes.

Large files: split into fixed-size data blocks as backup fragments (such as chunks in Google File System). Small files: merged into one large file data block as backup fragments (such as blocks in Taobao File System)

Distributed key-value system

Data distribution generally uses consistent hashing


Central control node: Config Server, which adopts the form of one master and one backup to ensure reliability. Data service node: Data ServerConfig Server is responsible for managing all Data Servers, maintaining their status information and data distribution strategies; Data Server provides various data services to the outside world, and Report its own status to Config Server in the form of heartbeat


Redis adopts a decentralized architecture

Distributed: Redis Cluster
Data partitioning: Redis Cluser uses virtual slot partitioning (an optimization of consistent hashing). Request routing: In cluster mode, when Redis receives any key-related command, it first calculates the slot corresponding to the key, and then finds the corresponding node based on the slot. , if the node is itself, process the key command; otherwise, reply with a MOVED redirection error and notify the client to request the correct node. Node communication: Using the P2P Gossip protocol, nodes continuously communicate with each other to exchange information. After a period of time, all nodes will know the complete information of the cluster. High availability: Redis Sentinel
Ensure fault detection and automatic failover of Redis nodes

Distributed database

Distributed databases (sub-databases and sub-tables) can be implemented at the application layer or at the middleware layer.

The advantage of the middleware layer implementation is that it is transparent to the application. The application queries the middleware layer just like querying a single table in a database. The disadvantage is that in addition to maintaining the middleware, it also needs to consider the HA/load balancing of the middleware, etc., which increases deployment and Difficult to maintain. The application layer implementation can be completed on the JDBC driver layer and DAO framework layer, such as MyBatis/Hibernate. For example, Dangdang’s sharding-jdbc is a JDBC driver layer implementation, and Alibaba’s cobar-client is based on the DAO framework iBatis.

big data storage

The underlying storage of the Hadoop big data family mainly uses the HDFS file system.

There are two key components of HDFS, one is DataNode and the other is NameNode.

DataNode is responsible for the storage and read and write operations of file data. HDFS divides file data into several data blocks (Blocks). Each DataNode stores a part of the data blocks, so that the files are distributed and stored throughout the HDFS server cluster. In order to ensure high availability of data, HDFS will copy a data block into multiple copies (the default is 3 copies), and store multiple copies of the same data block on different servers or even different racks; that is, HDFS The data shards stored in the DataNodes of the file system are heterogeneous. NameNode is responsible for the metadata (MetaData) management of the entire distributed file system, that is, file path names, data block IDs, storage locations and other information, which is equivalent to the role of the file allocation table (FAT) in the operating system. Application clients (Clients) can access these data blocks in parallel, allowing HDFS to achieve parallel access to data on a server cluster scale, greatly improving access speed.

cloud storage

Cloud storage is a cluster composed of a large number of ordinary storage devices that is virtualized into an easily scalable, elastic, transparent, and scalable storage resource pool through the network, and is provided to authorized users on demand with a unified interface; authorized users can use the network to Access and manage storage resource pools at will, and pay per use.

You may also like...