Starting from Zero: Build an LSM Database with 500 Lines of Code

This article expounds on a mini database with a core code of about 500 lines to understand the principle of the database by combining theory with practice.

Background

LSM-Tree is the underlying implementation of many NoSQL database engines, such as LevelDB and Hbase. Based on the design idea of the LSM-Tree database from Designing Data-Intensive Applications, this article expounds on a mini database with a core code of about 500 lines to understand the principle of the database by combining theory with practice.

1. SSTable (A Table Containing Sorted Strings)

Previously, a database was implemented based on Hash indexes, which has a limitation that says the entire Hash table needs to be put into memory and the interval query is not efficient.

In logs of the Hash index database, the storage order of the key is the order in which it is written, and the key that appears after the same key is prior to the previous one. Therefore, the key order in the log is not important, and the advantage is that it is easy to write. However, without the control of duplicate keys, the storage space is wasted. As a result, it takes more time to initialize the data.

Now, change the log writing requirements. The written keys are must be ordered, and the same key can only appear in a log once. This type of log is called SSTable, which has the following advantages over Hash index logs:

1) It is simpler and more efficient to merge multiple log files.

Since logs are ordered, merge sort for files can be used. In other words, you can read multiple input files at the same time, compare the first key of each file, and copy the data to the output file in sequence. If there are duplicate keys, only the key values in the latest log are retained, and the old ones are discarded.

2) When querying a key, there is no need to store all key indexes in the memory.

As shown in the following figure, let’s imagine there is a need to find handiwork, and the location of the key is not recorded in memory. However, as SSTable is ordered, we can see that the handiwork must be between the handbag and handsome if it exists. Then, the logs will be scanned from the handbag to the handsome. There are three benefits:

Only sparse indexes need to be recorded in memory, reducing the size of memory indexes.
As the query does not need to read the entire log, file I/O is reduced.
The interval query is supported.

2. Build and Maintain SSTable

We know that keys will appear in any order when being written, so how can we ensure that the keys in SSTable are ordered? A simple and convenient way is to save it to the red-black tree in the memory first since the red-black tree is orderly. Then, write it into the log file.

The workflow of the storage engine is listed below:

When writing, the storage engine is first added to the red-black tree of memory. This tree is called the memory table.
When the memory table is greater than a certain threshold, it is written to the disk as an SSTable file. Since the tree is orderly, it can be written directly to the disk in sequence.
To avoid a database crash when the memory table is not written to a file, the data can be written to another log (WAL) while being saved to the memory table. By doing so, the database can be recovered from WAL even if it crashes. The writing of this log is similar to the Hash index logs without guaranteeing the order since it is used to restore the data.
When processing a read request, the first attempt is to look up the key in the memory table. Then, query the SSTable log from the new to the old until the data is found or is empty.
The background process merges and compresses logs periodically and discards the overwritten or deleted values.

The algorithm above is the implementation of LSM-Tree, which is the Log-Structured Merge-Tree based on Log Merge Tree. A storage engine based on the principles of file merge and compression sort is usually called an LSM storage engine. This is also the underlying principle of databases, such as Hbase and LevelDB.

3. Implement an LSM-Based Database

We already know the implementation algorithm of the LSM-Tree. However, there are still many design issues to be considered in the specific implementation. Next, I will select some key designs for analysis.

3.1 Storage Structure of Memory Table

What is the value of the memory table? Does it store raw data directly or write commands, including set and RM? This is the first design problem we face. Without judgment, let's look at the next problem first.

When a memory table reaches a certain size, it is written into a log file for persistence. This process can be solved easily by disabling writing. However, what if the memory table has to be guaranteed to handle read/write requests properly while writing to the file?

One solution is to switch the current pointer of the memory table to instance B of the new memory table while persisting the memory table A. At this point, it is necessary to ensure that A is read-only and only B is writable after the switch. Otherwise, there is no guarantee that the process of persisting memory table A is an atomic operation.

Get Request: Query B first, query A next, and query SSTable last
Set Request: Write A directly
RM Request: Assume that key1 of RM only appears in A and does not appear in B. If the memory table stores raw data, RM cannot process the request because A is read-only, which causes the failure of RM. This problem can be solved if commands are stored in the memory table. In addition, write the RM command in B. When querying key1, you will see key1 has been deleted in B.

Therefore, if we do not allow writing when making the memory table persistent, value can be used to store the raw data directly. However, if we want to make the memory table persistent without forbidding to write, value must be used to store commands. We must pursue high performance without forbidding to write, so the value needs to be saved as a command. Hbase is also designed this way for the same reason.

Moreover, when the memory table has exceeded the threshold to be persisted, and the previous persistence has not been done, we need to wait for the previous persistence to complete before performing the current persistence. In other words, memory table persistence can only be performed serially.

3.2 SSTable File Format

We need to design the file format well to achieve efficient file reading.

The following is the SSTable log format I designed:

Data Field: This field is used to store write commands. Besides, for the convenience of segmented reading, the data field is segmented by a certain number of sizes.
Sparse Index Field: The sparse index stores the position index of each data segment in a file. When an SSTable is read, only the sparse index is loaded to the memory. The query is performed by loading the corresponding data segment according to the sparse index.
File Index Field: Location of the stored data

The log format above is a mini-implementation, which is simpler than the Hbase log format, so it is easy to understand the principle. Meanwhile, I also used JSON format to write files for easy reading. However, production efficiency is a priority, and data compression is required to save storage space.

4. Code Implementation Analysis

The code I wrote is implemented in TinyKvStore. I will analyze the key codes below. There is a lot of fine-grained codes. If you only care about the principle, skip this part. If you want to know the code implementation, keeping reading.

4.1 SSTable

Memory Table Persistence

To persist a memory table to SSTable is to write data from the memory table to a file according to the preceding log format. For SSTable, the written data is the data commands, including set and RM. As long as we know the latest command for the key, the status of a key in the database will be known.

/**
 * Convert from memory table to SSTable
 * @param index
 */
  private void initFromIndex(TreeMap< String, Command> index) {
    try {
        JSONObject partData = new JSONObject(true);
        tableMetaInfo.setDataStart(tableFile.getFilePointer());
        for (Command command : index.values()) {
            //Process set command
            if (command instanceof SetCommand) {
                SetCommand set = (SetCommand) command;
                partData.put(set.getKey(), set);
            }
            //Process RM command
            if (command instanceof RmCommand) {
                RmCommand rm = (RmCommand) command;
                partData.put(rm.getKey(), rm);
             }

            //Reach the number of segments and start writing data segments
            if (partData.size() >= tableMetaInfo.getPartSize()) {
                writeDataPart(partData);
            }
        }
        //If there is any data left (the tail data does not necessarily meet the segmentation condition) after the traversal, it is written to the file
        if (partData.size() > 0) {
             writeDataPart(partData);
        }
        long dataPartLen = tableFile.getFilePointer() - tableMetaInfo.getDataStart();
        tableMetaInfo.setDataLen(dataPartLen);
        //Save sparse index
        byte[] indexBytes = JSONObject.toJSONString(sparseIndex).getBytes(StandardCharsets.UTF_8);
        tableMetaInfo.setIndexStart(tableFile.getFilePointer());
        tableFile.write(indexBytes);
        tableMetaInfo.setIndexLen(indexBytes.length);
        LoggerUtil.debug(LOGGER, "[SsTable][initFromIndex][sparseIndex]: {}", sparseIndex);

      //Save file index
      tableMetaInfo.writeToFile(tableFile);
      LoggerUtil.info(LOGGER, "[SsTable][initFromIndex]: {},{}", filePath, tableMetaInfo);

    } catch (Throwable t) {
         throw new RuntimeException(t);
    }
}

The writing format is deduced backward from the read to facilitate read. For example, if tableMetaInfo is written from back to front, it should be read from back to front. This is why the version is written last. Since it is the first to be read, it is easier to upgrade the log format. If no one tried these tricks, it is hard to understand why they did it.

/**
 * Write data to a file
* @param file
*/
public void writeToFile(RandomAccessFile file) {
    try {
        file.writeLong(partSize);
        file.writeLong(dataStart);
        file.writeLong(dataLen);
        file.writeLong(indexStart);
        file.writeLong(indexLen);
        file.writeLong(version);
    } catch (Throwable t) {
        throw new RuntimeException(t);
    }
}

/**
* Read the meta information from the file, and read it backwards in the written order
* @param file
* @return
*/
public static TableMetaInfo readFromFile(RandomAccessFile file) {
    try {
        TableMetaInfo tableMetaInfo = new TableMetaInfo();
        long fileLen = file.length();

        file.seek(fileLen - 8);
        tableMetaInfo.setVersion(file.readLong());

        file.seek(fileLen - 8 * 2);
        tableMetaInfo.setIndexLen(file.readLong());

        file.seek(fileLen - 8 * 3);
        tableMetaInfo.setIndexStart(file.readLong());

        file.seek(fileLen - 8 * 4);
        tableMetaInfo.setDataLen(file.readLong());

        file.seek(fileLen - 8 * 5);
        tableMetaInfo.setDataStart(file.readLong());

        file.seek(fileLen - 8 * 6);
        tableMetaInfo.setPartSize(file.readLong());

        return tableMetaInfo;
    } catch (Throwable t) {
        throw new RuntimeException(t);
    }
}

Load SSTable from a File

When loading SSTable from a file, only sparse indexes need to be loaded, which saves memory. The fields, such as the data field, can be read on-demand when querying.

/**
     * Recover SSTable from file to memory
     */
    private void restoreFromFile() {
        try {
            //Read index first
            TableMetaInfo tableMetaInfo = TableMetaInfo.readFromFile(tableFile);
            LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][tableMetaInfo]: {}", tableMetaInfo);
            //Read sparse index
            byte[] indexBytes = new byte[(int) tableMetaInfo.getIndexLen()];
            tableFile.seek(tableMetaInfo.getIndexStart());
            tableFile.read(indexBytes);
            String indexStr = new String(indexBytes, StandardCharsets.UTF_8);
            LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][indexStr]: {}", indexStr);
            sparseIndex = JSONObject.parseObject(indexStr,
                    new TypeReference< TreeMap< String, Position>>() {
                    });
            this.tableMetaInfo = tableMetaInfo;
            LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][sparseIndex]: {}", sparseIndex);
        } catch (Throwable t) {
            throw new RuntimeException(t);
        }
    }

SSTable Query

When querying data from SSTable, the first step is to find the interval where the key is located from the sparse index. After that, read the data in the interval according to the location of the index recorded and perform the query. If there is any data, the data will be returned. Otherwise, null will be returned.

/**
 * Query data from SSTable
 * @param key
 * @return
 */
public Command query(String key) {
    try {
        LinkedList< Position> sparseKeyPositionList = new LinkedList<>();

        Position lastSmallPosition = null;
        Position firstBigPosition = null;

        //Find the last position that is smaller than key and the first position that is larger than key from the sparse index
        for (String k : sparseIndex.keySet()) {
            if (k.compareTo(key) <= 0) {
                lastSmallPosition = sparseIndex.get(k);
            } else {
                firstBigPosition = sparseIndex.get(k);
                break;
            }
        }
        if (lastSmallPosition != null) {
            sparseKeyPositionList.add(lastSmallPosition);
        }
        if (firstBigPosition != null) {
            sparseKeyPositionList.add(firstBigPosition);
        }
        if (sparseKeyPositionList.size() == 0) {
            return null;
        }
        LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][sparseKeyPositionList]: {}", sparseKeyPositionList);
        Position firstKeyPosition = sparseKeyPositionList.getFirst();
        Position lastKeyPosition = sparseKeyPositionList.getLast();
        long start = 0;
        long len = 0;
        start = firstKeyPosition.getStart();
        if (firstKeyPosition.equals(lastKeyPosition)) {
            len = firstKeyPosition.getLen();
        } else {
            len = lastKeyPosition.getStart() + lastKeyPosition.getLen() - start;
        }
        //If the key exists, it must be within the interval. Therefore, only data within the interval should be read to reduce I/O.
        byte[] dataPart = new byte[(int) len];
        tableFile.seek(start);
        tableFile.read(dataPart);
        int pStart = 0;
        //Read partition data
        for (Position position : sparseKeyPositionList) {
            JSONObject dataPartJson = JSONObject.parseObject(new String(dataPart, pStart, (int) position.getLen()));
            LoggerUtil.debug(LOGGER, "[SsTable][restoreFromFile][dataPartJson]: {}", dataPartJson);
            if (dataPartJson.containsKey(key)) {
                JSONObject value = dataPartJson.getJSONObject(key);
                return ConvertUtil.jsonToCommand(value);
            }
            pStart += (int) position.getLen();
        }
        return null;
    } catch (Throwable t) {
        throw new RuntimeException(t);
    }
}

4.2 LsmKvStore

Initial Load

DataDir: The data directory stores log data, so there is a need to read the previous persistent data when starting.
StoreThreshold: It means the persistence threshold. When the size of a table exceeds the threshold, the table should be persisted.
PartSize: The data partition threshold of SSTable
IndexLock: Read/write lock on the memory table
SSTables: An ordered list of SSTable, sorted from newest to oldest
WAL: Writes logs sequentially. It is used to save data in the memory table for data recovery.

The startup process is simple. It loads the data configuration and initializes the content. If data recovery needs to be done, restore the data to the memory table.

/**
 * Initialization
 * @param dataDir Data directory
 * @param storeThreshold Persistence threshold
 * @param partSize Data partition size
*/
public LsmKvStore(String dataDir, int storeThreshold, int partSize) {
    try {
        this.dataDir = dataDir;
        this.storeThreshold = storeThreshold;
        this.partSize = partSize;
        this.indexLock = new ReentrantReadWriteLock();
        File dir = new File(dataDir);
        File[] files = dir.listFiles();
        ssTables = new LinkedList<>();
        index = new TreeMap<>();
        //Directory is null without loading SSTable
        if (files == null || files.length == 0) {
            walFile = new File(dataDir + WAL);
            wal = new RandomAccessFile(walFile, RW_MODE);
            return;
        }

        //Load SSTable from large to small
        TreeMap< Long, SsTable> ssTableTreeMap = new TreeMap<>(Comparator.reverseOrder());
        for (File file : files) {
            String fileName = file.getName();
            //Recover data from a staged WAL. It is usually an exception in the process of persisting SSTable that leaves the walTmp
            if (file.isFile() && fileName.equals(WAL_TMP)) {
                restoreFromWal(new RandomAccessFile(file, RW_MODE));
            }
            //Load SSTable
            if (file.isFile() && fileName.endsWith(TABLE)) {
                int dotIndex = fileName.indexOf(".");
                Long time = Long.parseLong(fileName.substring(0, dotIndex));
                ssTableTreeMap.put(time, SsTable.createFromFile(file.getAbsolutePath()));
            } else if (file.isFile() && fileName.equals(WAL)) {
                //Load WAL
                walFile = file;
                wal = new RandomAccessFile(file, RW_MODE);
                restoreFromWal(wal);
            }
        }
        ssTables.addAll(ssTableTreeMap.values());
    } catch (Throwable t) {
        throw new RuntimeException(t);
    }
}

Write Operations

On the write operations, first, add a write lock. Then, the data is saved to the memory table and WAL. Besides, a judgment is also made that says if the threshold is exceeded, the data should be persisted. For the sake of simplicity, I executed it directly in series without using the thread pool, but it does not affect the overall logic. The codes for set and RM are similar and will not be repeated here.

@Override
public void set(String key, String value) {
    try {
        SetCommand command = new SetCommand(key, value);
        byte[] commandBytes = JSONObject.toJSONBytes(command);
        indexLock.writeLock().lock();
        //Save data to WAL first
        wal.writeInt(commandBytes.length);
        wal.write(commandBytes);
        index.put(key, command);

        //When the size exceeds the threshold, the memory table must be persisted
        if (index.size() > storeThreshold) {
            switchIndex();
            storeToSsTable();
        }
    } catch (Throwable t) {
        throw new RuntimeException(t);
    } finally {
        indexLock.writeLock().unlock();
    }
}

Memory Table Persistence

Switch the memory table and its associated WAL. First, put a lock on the memory table and then create a new memory table and WAL to store the old memory table and WAL temporarily. Then, release the lock. By doing so, the new memory table can be written, and the old memory table becomes read-only.

After performing the persistence process, the old memory table can be written into a new SSTable in sequence. Then, delete the staged memory table and WALs.

/**
  * Switch the memory table, create a new memory table, and store the old one temporarily
  */
  private void switchIndex() {
     try {
         indexLock.writeLock().lock();
         //Switch the memory table
         immutableIndex = index;
         index = new TreeMap<>();
         wal.close();
         //Switch WAL after switching the memory table
         File tmpWal = new File(dataDir + WAL_TMP);
         if (tmpWal.exists()) {
             if (!tmpWal.delete()) {
                 throw new RuntimeException("Failed to delete file: walTmp");
             }
         }
         if (!walFile.renameTo(tmpWal)) {
             throw new RuntimeException("Failed to rename file: walTmp");
         }
         walFile = new File(dataDir + WAL);
         wal = new RandomAccessFile(walFile, RW_MODE);
     } catch (Throwable t) {
         throw new RuntimeException(t);
     } finally {
         indexLock.writeLock().unlock();
     }
 }

/**
 * Save data to SSTable
 */
private void storeToSsTable() {
    try {
        //SSTable is named according to time, so that the name is incremental
        SsTable ssTable = SsTable.createFromIndex(dataDir + System.currentTimeMillis() + TABLE, partSize, immutableIndex);
        ssTables.addFirst(ssTable);
        //Delete the staged memory table and WAL_TMP after persistence is complete
        immutableIndex = null;
        File tmpWal = new File(dataDir + WAL_TMP);
        if (tmpWal.exists()) {
             if (!tmpWal.delete()) {
                 throw new RuntimeException("Failed to delete file: walTmp");
            }
        }
    } catch (Throwable t) {
        throw new RuntimeException(t);
    }
 }

Query Operations

The query operations are the same as those described in the algorithm:

Obtain it from the memory table first. If it is unavailable and an immutable memory table exists, take it from the immutable memory table.
If no query is available in the memory table, query SSTable sequentially from the new one to the old one.

@Override
public String get(String key) {
    try {
        indexLock.readLock().lock();
        //Obtain it from the index
        Command command = index.get(key);
        //Try again to obtain from the immutable index, which may be in the process of persisting SSTable
        if (command == null && immutableIndex != null) {
            command = immutableIndex.get(key);
        }
        if (command == null) {
            //If it's not in the index, try to obtain it from SSTable from the new one to the old one
            for (SsTable ssTable : ssTables) {
                command = ssTable.query(key);
                if (command != null) {
                    break;
                }
            }
        }
        if (command instanceof SetCommand) {
            return ((SetCommand) command).getValue();
        }
        if (command instanceof RmCommand) {
            return null;
        }
        //Null means it does not exist
        return null;
    } catch (Throwable t) {
        throw new RuntimeException(t);
    } finally {
        indexLock.readLock().unlock();
    }
}

Summary

If we do not build a database on our own, it is difficult to understand why this design is applied. For example, why is the log format designed this way, and why does the database store data operations instead of the data itself?

The database features explained in this article are relatively simple and can be optimized in many aspects. For example, perform data persistence and asynchronization, compress log files, and filter data with bloom filters when querying.

Reference

Designing Data-Intensive Applications

Community

Starting from Zero: Build an LSM Database with 500 Lines of Code

Background

1. SSTable (A Table Containing Sorted Strings)

2. Build and Maintain SSTable

3. Implement an LSM-Based Database

3.1 Storage Structure of Memory Table

3.2 SSTable File Format

4. Code Implementation Analysis

4.1 SSTable

Memory Table Persistence

Load SSTable from a File

SSTable Query

4.2 LsmKvStore

Initial Load

Write Operations

Memory Table Persistence

Query Operations

Summary

Reference

Related Links

Read previous post:

HansongXiao

You may also like

Comments

HansongXiao

Related Products

ApsaraDB for HBase

Storage Capacity Unit

Hybrid Cloud Storage

Hybrid Cloud Distributed Storage