LSM-Tree is the underlying implementation of many NoSQL database engines, such as LevelDB and Hbase. Based on the design idea of the LSM-Tree database from Designing Data-Intensive Applications, this article expounds on a mini database with a core code of about 500 lines to understand the principle of the database by combining theory with practice.
Previously, a database was implemented based on Hash indexes, which has a limitation that says the entire Hash table needs to be put into memory and the interval query is not efficient.
In logs of the Hash index database, the storage order of the key is the order in which it is written, and the key that appears after the same key is prior to the previous one. Therefore, the key order in the log is not important, and the advantage is that it is easy to write. However, without the control of duplicate keys, the storage space is wasted. As a result, it takes more time to initialize the data.
Now, change the log writing requirements. The written keys are must be ordered, and the same key can only appear in a log once. This type of log is called SSTable, which has the following advantages over Hash index logs:
1) It is simpler and more efficient to merge multiple log files.
Since logs are ordered, merge sort for files can be used. In other words, you can read multiple input files at the same time, compare the first key of each file, and copy the data to the output file in sequence. If there are duplicate keys, only the key values in the latest log are retained, and the old ones are discarded.
2) When querying a key, there is no need to store all key indexes in the memory.
As shown in the following figure, let’s imagine there is a need to find handiwork, and the location of the key is not recorded in memory. However, as SSTable is ordered, we can see that the handiwork must be between the handbag and handsome if it exists. Then, the logs will be scanned from the handbag to the handsome. There are three benefits:
We know that keys will appear in any order when being written, so how can we ensure that the keys in SSTable are ordered? A simple and convenient way is to save it to the red-black tree in the memory first since the red-black tree is orderly. Then, write it into the log file.
The workflow of the storage engine is listed below:
The algorithm above is the implementation of LSM-Tree, which is the Log-Structured Merge-Tree based on Log Merge Tree. A storage engine based on the principles of file merge and compression sort is usually called an LSM storage engine. This is also the underlying principle of databases, such as Hbase and LevelDB.
We already know the implementation algorithm of the LSM-Tree. However, there are still many design issues to be considered in the specific implementation. Next, I will select some key designs for analysis.
What is the value of the memory table? Does it store raw data directly or write commands, including set and RM? This is the first design problem we face. Without judgment, let's look at the next problem first.
When a memory table reaches a certain size, it is written into a log file for persistence. This process can be solved easily by disabling writing. However, what if the memory table has to be guaranteed to handle read/write requests properly while writing to the file?
One solution is to switch the current pointer of the memory table to instance B of the new memory table while persisting the memory table A. At this point, it is necessary to ensure that A is read-only and only B is writable after the switch. Otherwise, there is no guarantee that the process of persisting memory table A is an atomic operation.
Therefore, if we do not allow writing when making the memory table persistent, value can be used to store the raw data directly. However, if we want to make the memory table persistent without forbidding to write, value must be used to store commands. We must pursue high performance without forbidding to write, so the value needs to be saved as a command. Hbase is also designed this way for the same reason.
Moreover, when the memory table has exceeded the threshold to be persisted, and the previous persistence has not been done, we need to wait for the previous persistence to complete before performing the current persistence. In other words, memory table persistence can only be performed serially.
We need to design the file format well to achieve efficient file reading.
The following is the SSTable log format I designed:
The log format above is a mini-implementation, which is simpler than the Hbase log format, so it is easy to understand the principle. Meanwhile, I also used JSON format to write files for easy reading. However, production efficiency is a priority, and data compression is required to save storage space.
The code I wrote is implemented in TinyKvStore. I will analyze the key codes below. There is a lot of fine-grained codes. If you only care about the principle, skip this part. If you want to know the code implementation, keeping reading.
To persist a memory table to SSTable is to write data from the memory table to a file according to the preceding log format. For SSTable, the written data is the data commands, including set and RM. As long as we know the latest command for the key, the status of a key in the database will be known.
/**
* Convert from memory table to SSTable
* @param index
*/
private void initFromIndex(TreeMap< String, Command> index) {
try {
JSONObject partData = new JSONObject(true);
tableMetaInfo.setDataStart(tableFile.getFilePointer());
for (Command command : index.values()) {
//Process set command
if (command instanceof SetCommand) {
SetCommand set = (SetCommand) command;
partData.put(set.getKey(), set);
}
//Process RM command
if (command instanceof RmCommand) {
RmCommand rm = (RmCommand) command;
partData.put(rm.getKey(), rm);
}
//Reach the number of segments and start writing data segments
if (partData.size() >= tableMetaInfo.getPartSize()) {
writeDataPart(partData);
}
}
//If there is any data left (the tail data does not necessarily meet the segmentation condition) after the traversal, it is written to the file
if (partData.size() > 0) {
writeDataPart(partData);
}
long dataPartLen = tableFile.getFilePointer() - tableMetaInfo.getDataStart();
tableMetaInfo.setDataLen(dataPartLen);
//Save sparse index
byte[] indexBytes = JSONObject.toJSONString(sparseIndex).getBytes(StandardCharsets.UTF_8);
tableMetaInfo.setIndexStart(tableFile.getFilePointer());
tableFile.write(indexBytes);
tableMetaInfo.setIndexLen(indexBytes.length);
LoggerUtil.debug(LOGGER, "[SsTable][initFromIndex][sparseIndex]: {}", sparseIndex);
//Save file index
tableMetaInfo.writeToFile(tableFile);
LoggerUtil.info(LOGGER, "[SsTable][initFromIndex]: {},{}", filePath, tableMetaInfo);
} catch (Throwable t) {
throw new RuntimeException(t);
}
}
The writing format is deduced backward from the read to facilitate read. For example, if tableMetaInfo
is written from back to front, it should be read from back to front. This is why the version is written last. Since it is the first to be read, it is easier to upgrade the log format. If no one tried these tricks, it is hard to understand why they did it.
/**
* Write data to a file
* @param file
*/
public void writeToFile(RandomAccessFile file) {
try {
file.writeLong(partSize);
file.writeLong(dataStart);
file.writeLong(dataLen);
file.writeLong(indexStart);
file.writeLong(indexLen);
file.writeLong(version);
} catch (Throwable t) {
throw new RuntimeException(t);
}
}
/**
* Read the meta information from the file, and read it backwards in the written order
* @param file
* @return
*/
public static TableMetaInfo readFromFile(RandomAccessFile file) {
try {
TableMetaInfo tableMetaInfo = new TableMetaInfo();
long fileLen = file.length();
file.seek(fileLen - 8);
tableMetaInfo.setVersion(file.readLong());
file.seek(fileLen - 8 * 2);
tableMetaInfo.setIndexLen(file.readLong());
file.seek(fileLen - 8 * 3);
tableMetaInfo.setIndexStart(file.readLong());
file.seek(fileLen - 8 * 4);
tableMetaInfo.setDataLen(file.readLong());
file.seek(fileLen - 8 * 5);
tableMetaInfo.setDataStart(file.readLong());
file.seek(fileLen - 8 * 6);
tableMetaInfo.setPartSize(file.readLong());
return tableMetaInfo;
} catch (Throwable t) {
throw new RuntimeException(t);
}
}
When loading SSTable from a file, only sparse indexes need to be loaded, which saves memory. The fields, such as the data field, can be read on-demand when querying.
/**
* Recover SSTable from file to memory
*/
private void restoreFromFile() {
try {
//Read index first
TableMetaInfo tableMetaInfo = TableMetaInfo.readFromFile(tableFile);
LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][tableMetaInfo]: {}", tableMetaInfo);
//Read sparse index
byte[] indexBytes = new byte[(int) tableMetaInfo.getIndexLen()];
tableFile.seek(tableMetaInfo.getIndexStart());
tableFile.read(indexBytes);
String indexStr = new String(indexBytes, StandardCharsets.UTF_8);
LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][indexStr]: {}", indexStr);
sparseIndex = JSONObject.parseObject(indexStr,
new TypeReference< TreeMap< String, Position>>() {
});
this.tableMetaInfo = tableMetaInfo;
LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][sparseIndex]: {}", sparseIndex);
} catch (Throwable t) {
throw new RuntimeException(t);
}
}
When querying data from SSTable, the first step is to find the interval where the key is located from the sparse index. After that, read the data in the interval according to the location of the index recorded and perform the query. If there is any data, the data will be returned. Otherwise, null will be returned.
/**
* Query data from SSTable
* @param key
* @return
*/
public Command query(String key) {
try {
LinkedList< Position> sparseKeyPositionList = new LinkedList<>();
Position lastSmallPosition = null;
Position firstBigPosition = null;
//Find the last position that is smaller than key and the first position that is larger than key from the sparse index
for (String k : sparseIndex.keySet()) {
if (k.compareTo(key) <= 0) {
lastSmallPosition = sparseIndex.get(k);
} else {
firstBigPosition = sparseIndex.get(k);
break;
}
}
if (lastSmallPosition != null) {
sparseKeyPositionList.add(lastSmallPosition);
}
if (firstBigPosition != null) {
sparseKeyPositionList.add(firstBigPosition);
}
if (sparseKeyPositionList.size() == 0) {
return null;
}
LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][sparseKeyPositionList]: {}", sparseKeyPositionList);
Position firstKeyPosition = sparseKeyPositionList.getFirst();
Position lastKeyPosition = sparseKeyPositionList.getLast();
long start = 0;
long len = 0;
start = firstKeyPosition.getStart();
if (firstKeyPosition.equals(lastKeyPosition)) {
len = firstKeyPosition.getLen();
} else {
len = lastKeyPosition.getStart() + lastKeyPosition.getLen() - start;
}
//If the key exists, it must be within the interval. Therefore, only data within the interval should be read to reduce I/O.
byte[] dataPart = new byte[(int) len];
tableFile.seek(start);
tableFile.read(dataPart);
int pStart = 0;
//Read partition data
for (Position position : sparseKeyPositionList) {
JSONObject dataPartJson = JSONObject.parseObject(new String(dataPart, pStart, (int) position.getLen()));
LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][dataPartJson]: {}", dataPartJson);
if (dataPartJson.containsKey(key)) {
JSONObject value = dataPartJson.getJSONObject(key);
return ConvertUtil.jsonToCommand(value);
}
pStart += (int) position.getLen();
}
return null;
} catch (Throwable t) {
throw new RuntimeException(t);
}
}
The startup process is simple. It loads the data configuration and initializes the content. If data recovery needs to be done, restore the data to the memory table.
/**
* Initialization
* @param dataDir Data directory
* @param storeThreshold Persistence threshold
* @param partSize Data partition size
*/
public LsmKvStore(String dataDir, int storeThreshold, int partSize) {
try {
this.dataDir = dataDir;
this.storeThreshold = storeThreshold;
this.partSize = partSize;
this.indexLock = new ReentrantReadWriteLock();
File dir = new File(dataDir);
File[] files = dir.listFiles();
ssTables = new LinkedList<>();
index = new TreeMap<>();
//Directory is null without loading SSTable
if (files == null || files.length == 0) {
walFile = new File(dataDir + WAL);
wal = new RandomAccessFile(walFile, RW_MODE);
return;
}
//Load SSTable from large to small
TreeMap< Long, SsTable> ssTableTreeMap = new TreeMap<>(Comparator.reverseOrder());
for (File file : files) {
String fileName = file.getName();
//Recover data from a staged WAL. It is usually an exception in the process of persisting SSTable that leaves the walTmp
if (file.isFile() && fileName.equals(WAL_TMP)) {
restoreFromWal(new RandomAccessFile(file, RW_MODE));
}
//Load SSTable
if (file.isFile() && fileName.endsWith(TABLE)) {
int dotIndex = fileName.indexOf(".");
Long time = Long.parseLong(fileName.substring(0, dotIndex));
ssTableTreeMap.put(time, SsTable.createFromFile(file.getAbsolutePath()));
} else if (file.isFile() && fileName.equals(WAL)) {
//Load WAL
walFile = file;
wal = new RandomAccessFile(file, RW_MODE);
restoreFromWal(wal);
}
}
ssTables.addAll(ssTableTreeMap.values());
} catch (Throwable t) {
throw new RuntimeException(t);
}
}
On the write operations, first, add a write lock. Then, the data is saved to the memory table and WAL. Besides, a judgment is also made that says if the threshold is exceeded, the data should be persisted. For the sake of simplicity, I executed it directly in series without using the thread pool, but it does not affect the overall logic. The codes for set and RM are similar and will not be repeated here.
@Override
public void set(String key, String value) {
try {
SetCommand command = new SetCommand(key, value);
byte[] commandBytes = JSONObject.toJSONBytes(command);
indexLock.writeLock().lock();
//Save data to WAL first
wal.writeInt(commandBytes.length);
wal.write(commandBytes);
index.put(key, command);
//When the size exceeds the threshold, the memory table must be persisted
if (index.size() > storeThreshold) {
switchIndex();
storeToSsTable();
}
} catch (Throwable t) {
throw new RuntimeException(t);
} finally {
indexLock.writeLock().unlock();
}
}
Switch the memory table and its associated WAL. First, put a lock on the memory table and then create a new memory table and WAL to store the old memory table and WAL temporarily. Then, release the lock. By doing so, the new memory table can be written, and the old memory table becomes read-only.
After performing the persistence process, the old memory table can be written into a new SSTable in sequence. Then, delete the staged memory table and WALs.
/**
* Switch the memory table, create a new memory table, and store the old one temporarily
*/
private void switchIndex() {
try {
indexLock.writeLock().lock();
//Switch the memory table
immutableIndex = index;
index = new TreeMap<>();
wal.close();
//Switch WAL after switching the memory table
File tmpWal = new File(dataDir + WAL_TMP);
if (tmpWal.exists()) {
if (!tmpWal.delete()) {
throw new RuntimeException("Failed to delete file: walTmp");
}
}
if (!walFile.renameTo(tmpWal)) {
throw new RuntimeException("Failed to rename file: walTmp");
}
walFile = new File(dataDir + WAL);
wal = new RandomAccessFile(walFile, RW_MODE);
} catch (Throwable t) {
throw new RuntimeException(t);
} finally {
indexLock.writeLock().unlock();
}
}
/**
* Save data to SSTable
*/
private void storeToSsTable() {
try {
//SSTable is named according to time, so that the name is incremental
SsTable ssTable = SsTable.createFromIndex(dataDir + System.currentTimeMillis() + TABLE, partSize, immutableIndex);
ssTables.addFirst(ssTable);
//Delete the staged memory table and WAL_TMP after persistence is complete
immutableIndex = null;
File tmpWal = new File(dataDir + WAL_TMP);
if (tmpWal.exists()) {
if (!tmpWal.delete()) {
throw new RuntimeException("Failed to delete file: walTmp");
}
}
} catch (Throwable t) {
throw new RuntimeException(t);
}
}
The query operations are the same as those described in the algorithm:
@Override
public String get(String key) {
try {
indexLock.readLock().lock();
//Obtain it from the index
Command command = index.get(key);
//Try again to obtain from the immutable index, which may be in the process of persisting SSTable
if (command == null && immutableIndex != null) {
command = immutableIndex.get(key);
}
if (command == null) {
//If it's not in the index, try to obtain it from SSTable from the new one to the old one
for (SsTable ssTable : ssTables) {
command = ssTable.query(key);
if (command != null) {
break;
}
}
}
if (command instanceof SetCommand) {
return ((SetCommand) command).getValue();
}
if (command instanceof RmCommand) {
return null;
}
//Null means it does not exist
return null;
} catch (Throwable t) {
throw new RuntimeException(t);
} finally {
indexLock.readLock().unlock();
}
}
If we do not build a database on our own, it is difficult to understand why this design is applied. For example, why is the log format designed this way, and why does the database store data operations instead of the data itself?
The database features explained in this article are relatively simple and can be optimized in many aspects. For example, perform data persistence and asynchronization, compress log files, and filter data with bloom filters when querying.
Designing Data-Intensive Applications
2 posts | 0 followers
FollowApache Flink Community - June 11, 2024
ApsaraDB - June 7, 2022
Alibaba Cloud Storage - April 10, 2019
Alibaba Cloud Native Community - March 14, 2023
ApsaraDB - April 13, 2020
Alibaba Cloud Storage - April 25, 2019
2 posts | 0 followers
FollowApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreA cost-effective, efficient and easy-to-manage hybrid cloud storage solution.
Learn MoreProvides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.
Learn MoreMore Posts by HansongXiao