Data storage technology: standalone

Data storage technology: standalone

Data Storage Technology: Single Machine
Public account “EnjoyMoving”

Generally speaking, business application system = data storage + logical processing + network transmission. The data storage is the foundation of the system.

In business application systems, data that needs to be persisted to disk is generally stored in these storage systems:

Relational database: mysql, key-value system (NOSQL): redis, search engine: ElasticSearch, file system: independent files such as pictures and videos, logs, etc.

The functions of the storage system include: add, delete, search, and modify. Among them, reading is divided into random reading and sequential scanning.

Disk data storage structures are:

Random search type:
B-tree, LSM tree hash sequential scan type:

disk mechanism

Memory vs disk: Memory data will be lost after the computer is powered off and restarted, so the data files to be saved need to be persisted to the disk.

Disk reading and writing involves searching for data on the disk. The data address generally consists of cylinder number, disk surface number and block number. Disk reading relies on mechanical movement, which means that the moving arm first moves to the specified cylinder according to the cylinder number, then determines the track on the disk according to the disk number, and finally moves the specified track segment under the head according to the block number. Start reading and writing.

The entire process mainly consumes three parts of time: seek time, latency time, and transmission time, which respectively represent the time consuming of positioning the cylinder and the time consuming of moving the track segment specified by the block number to the magnetic head. , the time it takes to transfer data to memory.

Among them, the most time-consuming part of the entire disk IO is the search time (ie, disk seek). The search time overhead of sequential access is relatively small, while random access requires a longer disk seek time.

The sequential reading bandwidth of a 15,000-rpm SATA disk can reach more than 100MB/s, so the sequential reading of 1MB of data takes 100KB / 100MB/s * 1000 = 1ms; while the disk seek takes 10ms.

It can be seen that the bandwidth of disk sequential reading and writing of data is still good; the performance bottleneck of the storage system mainly lies in random disk reading and writing:
The time taken to read 100KB data from the disk = disk seek time 10ms + data read time 1ms = 11ms

If random disk reading and writing can be avoided as much as possible and the number of disk seeks can be reduced as much as possible, the performance of the storage system can be greatly improved. For example, convert random reading and writing into sequential reading and writing, reduce random reading and writing on disk through caching, etc.

The B-tree storage engine is a persistent implementation of the B-tree. It not only supports the addition, deletion, search, and modification of single records, but also supports sequential scanning.

Usually used to implement relational databases.
Balanced binary tree, B-tree, B+ tree

Let’s first recall what balanced binary trees, B-trees, and B+ trees look like:

Balanced binary tree: A balanced search tree, an ordered binary tree that is balanced by rotating nodes after each insertion

B-tree: multi-way balanced search tree

B+ tree: multi-way balanced search tree

The biggest difference from B-tree is that the internal nodes do not store data and are only used for indexing; all data (or records) are stored in leaf nodes; the leaf nodes themselves are linked in order from small to large according to the size of the key; And each leaf node stores pointers to adjacent leaf nodes.
why B-Tree

We know that ordered binary trees can be used for fast search; while B+ trees and balanced binary trees have little difference in search speed in terms of indexing data: although they store the same amount of data, the height of a 100-order B+ tree is definitely much lower than that of a balanced binary tree. , but the B+ tree may need to compare many times at a node to find the node of the next level, but the balanced binary tree only needs to compare once to go down one level. So taken together, the indexing speeds of the two are almost the same.

So why do storage engines generally use B+ trees instead of balanced binary trees? :

This is because the premise that the index speeds are equivalent is that the data structures of both are located in memory. When we want to index a record on the disk, we have to read the contact data into the memory every time to make a comparison judgment; compared with the disk IO time, the time spent in the process of comparing the data in the memory is basically negligible. Therefore, the number of disk IOs should be reduced as much as possible. To reduce the number of disk IOs, you must compress the height of the tree and make the tall and thin tree as short and fat as possible. Therefore, the B+ tree that supports multi-way search is definitely better than the balanced binary tree with two-way search, because in the B+ tree A node can store many keys. The same number of keys generates far fewer nodes in a B+ tree than in a binary tree. The difference in the number of nodes is equal to the number of disk IOs.

At the same time, organizing data in the form of a B+ tree on the disk has natural advantages: since disk IO is a very expensive operation, the computer operating system has optimized this – pre-reading. Because the principle of local read-ahead shows: when accessing an address data, the adjacent data will also be accessed soon. Therefore, every time disk IO is performed, not only the data at the current disk address is loaded into the memory, but adjacent data is also loaded into the memory buffer. The data read by each disk IO is called a page. The size of a page is related to the operating system, usually 4KB or 8KB. This means that when reading data within a page, a disk IO actually occurs.

Assume that we now choose 4KB as the transmission unit between memory and disk, then when we design the B+ tree, we use 4KB as the node size for both index nodes and leaf nodes. Assuming the size of a record is 1KB, then a leaf node can store 4 records. As for the index node (the size is also 4KB), since only the key value and the corresponding pointer need to be stored, an index node may be able to store 100~150 branches, so the entire tree can store approximately 4 million records:

Consider the B+ tree shown in the figure above. If this B+ tree is stored on disk, how do we find the corresponding record according to the record key?

1) The first disk IO: read the root node of this B+ tree into the memory, then index it in the memory, and find the corresponding branch according to the key;

2) Second disk IO: Read the second-level index node pointed by this branch into the memory, and then index it in the memory. Also find the corresponding branch based on the key, and this branch points to the leaf node. ;

3) The third disk IO: Read this leaf node into the memory and determine whether this record exists.

In this way, we only need to pass three disk IO times to find the corresponding key record from 4 million records, which can be said to be very fast.

The reason for the speed is that no data is stored in the index node, only keys and pointers are stored, so an index node can store a large number of branches, and an index node only requires one IO to read into the memory.
LSM (Log Structure Merge Tree)

LSM tree is actually an idea, an optimization method for the B-tree storage engine, which uses batch dump technology to avoid random disk writes (multiple disk seeks). Compared with B-trees, LSM trees sacrifice part of the read performance to greatly improve write performance.

Storage systems implemented through LSM trees include: levelDB, HBase, etc.

The design idea of the LSM tree is very simple: keep the incremental modifications to the data in the memory, and write these modification operations to the disk in batches after reaching the specified size limit. However, it is a little troublesome to read. It requires merging historical data on disk and recent modification operations in memory.

Therefore, the writing performance is greatly improved. When reading, you may need to check whether the memory is hit first, otherwise you need to access more disk files.

LSM trees can have many optimization strategies in actual projects, such as using Bloom filters to quickly determine whether a key exists, doing some additional indexes to help find records faster, etc.

The LSM tree consists of two or more storage structures. Taking two storage structures as an example, one storage structure is resident in the memory and is called C0 tree. Specifically, it can be any data structure that facilitates key value search, such as red-black tree, map and the like. Another storage structure resides in the hard disk, called the C1 tree, and its specific structure is similar to the B-tree.

1) Insert operation:

When inserting a new record, first insert the operation log into the log file so that it can be restored later. The log is inserted in append form, so it is very fast;

Insert the index of the new record into C0, which is done in memory and does not involve disk IO operations;

When the size of C0 reaches a certain threshold or at regular intervals, the records in C0 are rolled and merged into disk C1; for the case of multiple storage structures, when the size of C1 becomes larger and larger, it is merged into C2, and so on. Merge Ck all the way up.

2) Merge operation:

Read the data of C1 into the memory, merge and sort it with the data of C0 in the memory (similar to “merge sort”), and the merged result is appended to the new position of the C1 tree in the disk (note that it is appended and not change the original node).

3) Update operation:

The update operation can be converted into an insert operation directly in the memory C0 tree:

If the record exists in the memory C0 tree, it can be replaced directly;

If the record is on disk, you can overwrite the old record on disk with the new record in the in-memory C0 tree while performing the merge asynchronously.

Because reading data gives priority to reading data in the memory C0 tree, it can be guaranteed that the data read is the latest.

4) Delete operation:

In order to perform the deletion operation quickly, it is mainly implemented through marking: mark the record to be deleted in the memory C0 tree, and then delete the corresponding record when the merge is performed asynchronously.

If the record to be deleted does not exist in the C0 tree, a node for the record is generated in the C0 tree and marked for deletion. In this way, when searching, you can know in the memory that the record has been deleted, without having to look for it on the disk.
Hash storage

The hash storage engine is a persistent implementation of the hash table and supports addition, deletion, modification, and random read operations, but does not support sequential scanning.

Usually used to implement key-value storage systems, such as nosql.

In the hash table structure of the hash storage engine, the hash index can be stored in the memory hash structure, and the key-value data is stored in the disk.

Each item of the memory hash index contains three pieces of information used to locate disk value data: file number (file_id), value position in the file (value_pos), and value length (value_size).

1) During the read operation, obtain the corresponding value positioning information from the memory hash table through the key, and then read the value_size bytes starting from the value_pos of the file corresponding to the file_id from the disk to obtain the value.

2) During a write operation, first append the key-value record to the end of the active data file (file), and then update the memory hash table. When the data in an active data file reaches the size limit, the file becomes an old data file; new records are appended to the end of the newly created active data file.

3) During the update operation, in order to avoid random reading and writing of the disk, the “update record” operation can be replaced by “directly adding new records + using the timestamp to determine which record of the same key is the latest”.

Therefore, each read operation includes a memory read operation and a disk seek read operation; the write operation includes a sequential disk write (append) and a memory operation.

Regular merges:

After the records in the key-value system are deleted or updated, the original records will become junk data. Therefore, the data in the old data file needs to be scanned regularly, multiple operations on the same key are merged based on the principle of retaining only the latest record, and the merged data is written into the newly generated data file.

Data Recovery:

When the machine is restarted after a power outage, the hash index in the memory will be lost. At this time, the memory hash structure can be reconstructed by scanning the data in the disk file (file). This is a very time-consuming process if the data file is large.
Therefore, the speed of rebuilding the memory hash index table can be improved through the index file (hint file): simply put, the index file is the result file generated by dumping the in-memory hash index table to disk. In this way, when rebuilding the memory hash table, there is no need to scan all data files, but only need to read and rebuild the data in the index file line by line.

Logs can be divided into two major categories:

One type is the business log that we usually print in the code, which is used to record system operation information and locate problems; the other type is the operation log (commit log), which is used to optimize disk read and write performance, system failure recovery, etc.

The storage of logs on disk is generally written to files in an appending manner.

Business logs will not be discussed here; below we will mainly talk about the operation log (commit log) which is particularly important for the storage system.
Operation log

Operation logs are divided into: redo log (redo log), rollback log (undo log) and rollback/redo log (undo/redo log)

Rollback log records the state before the transaction is modified; redo log records the modified state of the transaction. For example: transaction T performs an operation of adding 10 to X in the table. Initially, X=5, and after update, X=15. Then the undo log is recorded as <T, , the undo/redo log is recorded as <T, X, 5, 15>.

The constraint rule for writing operation logs is: “log first”, that is, before modifying element

The reason for “log first” is that the data in the memory is modified first, so the user can read the modified results immediately. Once a failure occurs between completing the memory modification and writing the operation log, the latest modification operation cannot be Recovery; however, previous users may have read the modified results, which may cause data inconsistency.

The process of writing operation log is as follows:

1) Write the operation log to the log file on the disk by append writing 2) Apply the modification operation of the operation log to the memory 3) Return the success or failure of the operation

The function of operation log:

1) Optimize disk write performance by converting random writes into sequential writes

Take the database as

For example, in order to ensure the consistency of the database, database operations need to be persisted to the disk. If each operation randomly updates a certain data block on the disk, system performance will be very poor. Therefore, each database operation can be recorded sequentially through the operation log and executed in memory. The data in the memory is regularly refreshed to the disk through a background thread, thereby converting random write disk data requests into sequential write operation log requests.

2) System failure recovery (redo log)

Based on the “log first” principle, all operations on data must first have operation logs.

Therefore, if the machine is powered off and restarted, the data in the memory that has not been dropped to the disk will be lost; at this time, fault recovery can be performed through the redo log. You only need to read the modification operations in the log file from beginning to end and apply these operations one by one. To the data in the memory/disk, the data can be recovered.

3) Implement transaction operation rollback (rollback log)

Taking the database as an example, a failure may occur during the operation of the database. At this time, some transactions may be executed halfway but not submitted. When the system is restarted, the data needs to be able to be restored to a consistent state, i.e. either the entire transaction is committed or rolled back.

Since the undo log records the state before the data change, the data can be restored to the state before the transaction through the undo log.

Operation log optimization:

1. Group Commit: mainly for redo logs

The storage system requires that the operation log be flushed to the disk before the data in the memory can be updated. If each transaction requires the log to be flushed to the disk immediately, limited by disk IO, this operation will still be the bottleneck of transaction concurrency, and the system throughput will be It will be very bad.

Therefore, the storage system often has an option to flush the disk immediately. For applications with high consistency requirements, it can be set to flush immediately. Correspondingly, for applications with low consistency requirements, it can be set to not require immediate flushing. To flush, first cache the operation log into the memory buffer, and periodically flush it to the disk through a background thread. For example, when any of the conditions of “the amount of data in the log buffer exceeds a certain size/the time since the last flush to the disk exceeds a certain time” is met. , flush the operation log to disk. There is a problem with this approach. If the storage system fails unexpectedly, the last part of the update operation may be lost. Therefore, if the redo log of a certain transaction has not been fully flushed to the disk and the machine is powered off and restarted, the transaction can be restored through the undo log. Roll, discarding the transaction.

The idea of group commit is similar to that of regular disk flushing. By merging the disk flushing actions of multiple transaction operation logs, the number of disk flushes can be reduced. However, the difference between group commit and regular flushing to the disk is that the group commit technology must ensure that the operation log is successfully flushed to the disk before the write operation can be returned successfully. This may sacrifice the latency of write transactions, but greatly improves the throughput of the system.

Taking InnoDB as an example, when InnoDB changes any data, it will write a redo log to the memory log buffer. InnoDB flushes the contents of the buffer to the disk log file when the buffer is full, when the transaction commits, or every second (either condition is met). When InnoDB’s Group Commit is turned on, you can submit multiple transactions in one disk IO operation through group commit:

In the InnoDB log system, each redo log has an lsn (Log Sequence Number), and the lsn is monotonically increasing. Each transaction performing an update operation will contain one or more redo logs. When each transaction copies the log to the memory buffer log_sys_buffer (log_sys_buffer is protected by log_mutex), it will obtain the current largest lsn, so it can be guaranteed that the lsn of different transactions will not be repeated. . Assume that the maximum lsn of the logs of the three transactions Trx1, Trx2 and Trx3 are lsn1, lsn2 and lsn3 respectively (lsn1<lsn2<lsn3). They are submitted at the same time. Then if the Trx3 log first obtains the log_mutex for placement, it can be Also clear the log [lsn1—lsn3], so that Trx1 and Trx2 do not need to request disk IO again, thus enabling multiple transactions to be submitted in one disk IO operation. The basic process for group submission is as follows:

1) Get log_mutex2) If flushed_to_disk_lsn (lsn that has been flushed) >= lsn, it means that the log has been flushed, jump to 53) If current_flush_lsn (lsn that is flushing) >= lsn, it means that the log is being flushed, jump to After turning to 5, enter the waiting state 4) Flush and sync the logs smaller than lsn 5) Exit log_mutex

For more information about InnoDB group submission, please refer to: [Illustration of MySQL] MySQL group.

2. Checkpoint: mainly for redo logs

By recording operation logs, the storage system can save the data in the memory and immediately return success to the user; and then slowly flush the data in the memory to the disk through the background thread. However, if the system fails/restarts after a power outage, the data in the memory will be lost. At this time, you need to rely on the redo log to restore the data in the memory: when recovering from the fault, all redo logs need to be played back, which is less efficient; if the redo log is larger than If there are too many, the fault recovery time will be very long.

Checkpoint technology: The system regularly dumps the data in the memory to the disk in an easy-to-load form, and records the log playback point at the checkpoint moment. In the future, fault recovery only needs to play back the log playback at the checkpoint moment. Redo log after clicking.

You may also like...