By Wenqu and Haiqian
The sixth article of this series (OceanBase Source Code Interpretation (6): Detailed Explanation of Storage Engines) explained the OceanBase storage engine in detail and answered questions about the OceanBase database.
The seventh article of this series briefly introduces the index build process of OceanBase from the perspective of code introduction and explains the relevant code of index build.
First of all, in a general database, what is the semantics of the index table?
Independent from the primary table (also called the data table), the index table consists of an array of redundant and orderly data and is created to speed up some query processes. The reason why the index table can accelerate query processes is that it is sorted by index keys. If one query condition of a query statement accords with the index prefix, the corresponding rows can be quickly found through binary search. The rows of the index table include the primary key information, so you can quickly find the complete rows of the primary table through the primary keys. This process is also called indexing back to the table.
After knowing the semantics of the index table, how do we create an index?
The index table (like the primary table) has its schema, memory structure, and persistent data (usually stored on the local disk). It may also have its location information in distributed scenarios. Create an index means creating a schema of the index table. Then, create a memory structure of the index table in a certain location and persistent data.
During index creation, we do not want to affect the normal reading and writing of the primary table. The business is online during the index build. Different manufacturers have different implementation mechanisms for online indexing. This article will introduce how OceanBase implements online indexing. Students interested in other solutions can refer to relevant documents.
First, let's look at the process of index build from the user's perspective. For example, a user sends an index build statement "create index i1 on t1(c2)" on a session, and the user waits on the current session until the index construction succeeds or fails.
What are the processes from the observer's perspective? First, the text of this statement is randomly sent to an observer (obs). The obs that receives this statement is called central obs. As with other statements, the SQL statement for index build is first found to be a DDL statement by parser and resolver and then parsed into a structure like OceanBaseCreateIndexArg
. For DDL statements, OceanBase sends them to RootService (RS) for processing. Therefore, central obs sends an RPC request for index build to RS. This RPC request carries OceanBaseCreateIndexArg
.
After receiving the request, RS processes the request through the ObRootService::create_index
function. After RS completes the necessary synchronization, it sends the RPC feedback to the central obs, but the index build is not complete. The RS will advance the index build through asynchronous tasks. After receiving the feedback from RS, central obs keeps querying the schema status of the index table to obtain the result of the index build. If the build is complete, it will give positive feedback to the client. If the build fails, it will send the failure error code to the client.
As mentioned above, RS completes some synchronization processing before sending the feedback to central obs. In this section, we take a look at the specific process of this part.
Call path:
ObRootService::create_index -> ObIndexBuilder::create_index -> ObIndexBuilder::create_index::do_create_index -> ObIndexBuilder::do_create_local_index
-> ObIndexBuilder::do_create_global_index
The call process above will do some defensive checks, such as system tables and tables in recycle bin. OceanBase does not support index build. If the number of indexes in a table exceeds the upper limit, indexing is not allowed. After the check, select the local or global index building process based on the index type.
What is the difference between a global index and a local index? The main difference is that local indexes are partition-level. Partitions of the index table correspond to partitions of the main table one by one. While global indexes are table-level, partitions of the global index table have no corresponding relationship with the partitions of the main table.
In a word, local is at the partition level, and global is at the table level. For example, Table 1 has two hash partitions. If a local index 1 is created, it must have two partitions. The first partition of i1 is the index of the first partition of T1. The second partition of i1 is the index of the second partition of T1. If you create a full-office index i2 for T1, i2 can have one partition or multiple partitions, and the partitions do not correspond to the primary table.
As the local index corresponds to the partitions of the primary table one by one, in OceanBase, we closely bind the partitions of the local index with the partitions of the primary table. This way, the location information of the partitions of the primary table and the partitions of the index table are consistent (on the same machine), thus avoiding cross-machine distributed transactions. Therefore, when selecting the index build path above, there is an optimization for the global index. If the primary table and index table of the global index are non-partitioned, this global index can follow the process of building a local index.
Call path of key functions:
ObIndexBuilder::do_create_global_index -> ObIndexBuilder::generate_schema
-> ObDDLService::create_global_index -> ObDDLService::generate_global_index_locality_and_primary_zone
-> ObDDLService::create_user_table -> ObDDLService::create_table_in_trans -> ObDDLOperator::create_table
-> ObDDLService::create_table_partitions
-> ObDDLService::publish_schema
-> ObIndexBuilder::submit_build_global_index_task -> ObGlobalIndexBuilder::submit_build_global_index_task
ObIndexBuilder::generate_schema
is responsible for generating the basic information of the index table schema. Other information is mainly from the primary table. The index table mainly focuses on the column information. Normal index tables only include index columns and primary key columns, and the duplicate primary key columns are omitted. The unique index must contain index columns, hidden key columns, and primary key columns. What is the hidden key? The hidden key helps understand the comparison issues when the value of the index column is null.
In SQL semantics, null and null are not equal, but in the aspect of code comparison, the values of null and null are equal. If there is a null in the index column, the specific value of the primary key will be entered in the hidden key column. If the index column is not null, the null value will be entered in the hidden key column. The semantics (null and null are not equal) can be achieved during index key comparison with the primary hidden key column. The generate_schema is just a memory object that generates the schema of the column index table. In this case, the index table is not available, so the state of the index table is set to INDEX_STATUS_UNAVAILABLE
in the schema.
After the index table schema is generated, you need to write the schema to the internal table. This step is done by ObDDLOperator::create_table
.
Then, it is also necessary to create a memory structure of the index table on the relevant machine. Therefore, the location information of the index table is generated through the ObDDLService::generate_global_index_locality_and_primary_zone
, and RPC is sent to the target machines through the ObDDLService::create_table_partitions
to notify them to create the memory structure of each partition of the index table, including memtable, table_store, and the mapping from partition_key to table_store. Then, notify other machines to refresh the schema through ObDDLService::publish_schema
.
After the schema and memory structure of the index table are created, submit the control task with data completed from the global index to the queue through ObGlobalIndexBuilder::submit_build_global_index_task
. Later, the control task is used to advance the data completion process of the global index.
When the control task is submitted, the submit_build_global_index_task
will create a task record in the __all_index_build_stat
, and the status of the control task will be updated to the __all_index_build_stat table
.
The global index control task is executed by the ObGlobalIndexBuilder
. This thread pool only has one thread, and the queue length is limited by memory (no memory upper limit is set). The entry point for task execution is ObGlobalIndexBuilder::run3 -> ObGlobalIndexBuilder::try_drive
.
Call path of key functions:
ObIndexBuilder::do_create_local_index -> ObIndexBuilder::generate_schema
-> ObDDLService::create_user_table -> ObDDLService::create_table_in_trans -> ObDDLOperator::create_table
-> ObDDLService::create_table_partitions
-> ObDDLService::publish_schema
-> ObIndexBuilder::submit_build_local_index_task -> ObRSBuildIndexScheduler::push_task
The procedures for generating schema and creating memory objects for local indexes are almost the same as for global indexes. The only difference is that local indexes do not need to generate index table location information. Other processes are not described here.
After the schema and memory structure of the index table are created, the control task ObRSBuildIndexTask
of the local index is put into the queue through ObRSBuildIndexScheduler::push_task
. At the same time, the internal table __all_index_build_stat
is updated.
The ObDDLTaskExecutor
is responsible for executing the control task of the local index. This executor only has one thread, and the queue length is limited by memory (the upper limit of memory is 1GB). The entry for task execution is ObDDLTaskExecutor::run1 -> ObRSBuildIndexTask::process
.
The control task of the global index, ObGlobalIndexTask
, designs a simple state advancement to execute corresponding functions for each task state. The overall idea is to build the baseline data on one copy of the index table first, copy the baseline data to other copies, perform the necessary consistency and uniqueness checks, and let the index take effect.
Code path:
process_function task_status
----------------------------------------------------------------------------
ObGlobalIndexBuilder::try_drive -> try_build_single_replica GIBS_BUILD_SINGLE_REPLICA
-> try_copy_multi_replica GIBS_MULTI_REPLICA_COPY
-> try_unique_index_calc_checksum GIBS_UNIQUE_INDEX_CALC_CHECKSUM
-> try_unique_index_check GIBS_UNIQUE_INDEX_CHECK
-> try_handle_index_build_take_effect GIBS_INDEX_BUILD_TAKE_EFFECT
-> try_handle_index_build_failed GIBS_INDEX_BUILD_FAILED
-> try_handle_index_build_finish GIBS_INDEX_BUILD_FINISH
Single copy build refers to completing the index table baseline data on one copy. According to the LSM-Tree structure of OceanBase, the baseline data here refers to the major SSTable of the index table.
Code path:
ObGlobalIndexBuilder::try_build_single_replica -> launch_new_build_single_replica -> get_global_index_build_snapshot -> do_get_associated_snapshot
-> hold_snapshot
-> update_task_global_index_build_snapshot
-> do_build_single_replica -> ObRootService::submit_index_sstable_build_task
-> drive_this_build_single_replica -> ObIndexChecksumOperator::check_column_checksum
If you want to build a single copy, you must select a snapshot point to ensure the index table can be seen in all DML operations (incremental data) of the primary table after the snapshot point. This means DML operations after the snapshot point will simultaneously modify the index table. However, this index table is not available for query operations. This write-only behavior is the key for OceanBase to implement an online index build.
After the baseline data (existing data) is constructed based on this snapshot point, the LSM-Tree query will fuse multiple layers of data, so the integrity of the index table data can be guaranteed by idempotence. Let's assume the schema_version
is v1 when the index table is created. You need to wait until all transactions that depend on schema_version <= v1
are completed to get that snapshot point. The do_get_associated_snapshot
function is the leader sending RPCs to the partitions of the primary table to ask if these transactions are complete. The OBS receiving the request processes it through the ObService:: check_schema_version_elapsed
interface. The do_get_associated_snapshot
waits for all RPCs to return through wait_all
. Note: The RPCs here are bulk and synchronous, so a very large number of partitions may block the process of index task pushing thread.
You need to hold the snapshot point to ensure the selected snapshot point is not released during the single copy construction process. If the snapshot is held for too long, the number of table_store
may explode. Then, the selected snapshot points are updated to the internal table __all_index_build_stat
. Finally, submit a build task ObIndexSSTableBuildTask
for the baseline data of the index table.
After submitting the completion task of baseline data, check the task status through drive_this_build_single_Replica
. If the baseline data construction is completed, check the data consistency of the primary table and index table through checksum.
The task, ObIndexSSTableBuildTask
, is executed by the IdxBuild thread pool. The task queue is 4096, and the number of threads is 16.
Look at the ObIndexSSTableBuildTask
execution process and the code path:
ObIndexSSTableBuildTask::process -> ObIndexSSTableBuilder::init
-> ObIndexSSTableBuilder::build -> ObCommonSqlProxy::execute -> ObInnerSQLConnection::execute -> ObInnerSQLConnection::query -> ObInnerSQLConnection::do_query -> ObIndexSSTableBuilder::ObBuildExecutor::execute -> ObIndexSSTableBuilder::build
-> ObIndexSSTableBuilder::ObBuildExecutor::process_result -> ObResultSet::get_next_row
-> ObGlobalIndexBuilder::on_build_single_replica_reply
The function, ObIndexSSTableBuilder::build
, is executed synchronously. A maximum of 16 baseline completion tasks are executed simultaneously in the system. After the execution is completed, the status of the baseline completion task is changed by on_build_single_replica_reply
.
The code path above seems complicated, but a physical execution plan is finally constructed through ObIndexSSTableBuilder::build
and executed through ObResultSet::get_next_row
. The following code path shows the generation process of the physical execution plan. The constant starting with PHY refers to the type of physical operator.
ObIndexSSTableBuilder::build -> generate_build_param -> split_ranges
-> store_build_param
-> gen_data_scan PHY_TABLE_SCAN_WITH_CHECKSUM
PHY_UK_ROW_TRANSFORM
-> gen_data_exchange PHY_DETERMINATE_TASK_TRANSMIT
PHY_TASK_ORDER_RECEIVE
-> gen_build_macro PHY_SORT
PHY_APPEND_LOCAL_SORT_DATA
-> gen_macro_exchange PHY_DETERMINATE_TASK_TRANSMIT
PHY_TASK_ORDER_RECEIVE
-> gen_build_sstable PHY_APPEND_SSTABLE
-> gen_sstable_exchange PHY_DETERMINATE_TASK_TRANSMIT
PHY_TASK_ORDER_RECEIVE
The final physical execution plan is shown below:
coordinator | ObTaskOrderReceive
transmit | ObDeterminateTaskTransmit
append_sstable | ObTableAppendSSTable
receive | ObTaskOrderReceive
transmit_macro_block | ObDeterminateTaskTransmit
append_local_sort_data | ObTableAppendLocalSortData
sort | ObSort
receive | ObTaskOrderReceive
transmit_by_range | ObDeterminateTaskTransmit
table_scan_with_checksum | ObTableScanWithChecksum
Code path:
ObGlobalIndexBuilder::try_copy_multi_replica -> launch_new_copy_multi_replica -> build_task_partition_sstable_stat -> generate_task_partition_sstable_array
-> drive_this_copy_multi_replica -> check_partition_copy_replica_stat
-> build_replica_sstable_copy_task -> ObCopySSTableTask::build
-> ObRebalanceTaskMgr::add_task
Multi-replica copy is the process of copying baseline data built in the single-replica construction process to other replicas. The actual data copy is completed by ObCopySSTableTask
. The task is executed by the ObRebalanceTaskMgr
of RS. The entry point is ObCopySSTableTask::execute
, which is the RPC that sends the copy_sstable_batch
. The execution entry that receives the obs of the RPC is ObService::copy_sstable_batch
. After the task of baseline data copy is completed, obs reports the result to RS. RS executes the callback ObGlobalIndexBuilder::on_copy_multi_replica_reply
and updates the status of the multi-replica copy task.
For a unique index, you need to check the uniqueness of the data in index columns. You do not need to perform this check for a non-unique index.
Code path:
ObGlobalIndexBuilder::try_unique_index_calc_checksum -> launch_new_unique_index_calc_checksum -> get_checksum_calculation_snapshot -> do_get_associated_snapshot
-> do_checksum_calculation -> build_task_partition_col_checksum_stat
-> send_checksum_calculation_request -> send_col_checksum_calc_rpc
-> drive_this_unique_index_calc_checksum
You need to select a snapshot point to check the uniqueness. After this snapshot point, the baseline of the index table can be used to see all DML operations (incremental data) on the primary table, so you can check the uniqueness of the DML process. The data before this snapshot point (stock data) can be used to calculate the checksum
of the primary and index table columns at this snapshot point. The uniqueness can be checked by comparing with checksum. At this snapshot point, all new transactions of copies can see the baseline data. Let's assume the maximum timestamp for each copy to see the baseline data is sstable_ts
. You need to wait for the context creation timestamp of all transactions to pass sstable_ts
. The function get_checksum_calculation_snapshot
completes the preceding operations and checks whether the timestamp of transaction context creation passes the sstable_ts
through the entry: ObPartitionService: check_ctx_create_timestamp_elapsed
.
After the snapshot point is available, an RPC is sent to ask the leaders of the primary table and index table to calculate the column checksum
of the snapshot point. The processing entry for the obs that receive the RPC is ObService::calc_column_checksum_request
. After the calculation is completed, record the column checksum in the internal table __all_index_checksum
and notify RS through RPC. RS executes the callback ObGlobalIndexBuilder::on_col_checksum_calculation_reply
to update the status of the checksum calculation task. The drive_this_unique_index_calc_checksum
continuously checks the status of the checksum calculation task. If all checksum
calculations are completed, the checksum comparison is executed by ObGlobalIndexBuilder: try_unique_index_check -> ObIndexChecksumOperator::check_column_checksum
.
If all the preceding steps are completed, the ObGlobalIndexBuilder::try_handle_index_build_take_effect
function is used to make the index take effect. The schema status of the index table is modified to INDEX_STATUS_AVAILABLE
. After the central obs identifies this status, it returns a build success to the client session.
If any of the preceding steps fails, the function changes the index table status to INDEX_STATUS_INDEX_ERROR
. After the central obs identifies the status, it returns an index build failure to the client session.
After the index build process ends, the intermediate state cleanup must be performed whether the result is successful or failed, including clearing the intermediate result of SQL execution, releasing snapshots, and cleaning up internal tables.
Code path:
ObGlobalIndexBuilder::try_handle_index_build_finish -> clear_intermediate_result -> ObIndexSSTableBuilder::clear_interm_result
-> release_snapshot
The RS control process of local indexes is relatively simple because the RS side is not the main battlefield.
Code path:
ObRSBuildIndexTask::process -> wait_trans_end -> ObIndexWaitTransStatus::get_wait_trans_status
-> calc_snapshot_version
-> acquire_snapshot
-> wait_build_index_end -> report_index_status
-> report_index_status
-> release_snapshot
As the partitions of the local index are bound to the partitions of the primary table one by one, the main battlefield of the local index build is on the obs where the partitions of the primary table are located. Obs triggers the task of building a local index by monitoring the DDL changes of each tenant. After the schema of the index table is launched, obs (where the primary table is located) updates the schema and initiates the local index build task.
Code path:
ObTenantDDLCheckSchemaTask::process -> process_schedule_build_index_task -> get_candidate_tables
-> find_build_index_partitions
-> generate_schedule_index_task -> ObBuildIndexScheduler::push_task(ObBuildIndexScheduleTask)
ObTenantDDLCheckSchemaTask
will find the partition_key
to build the index, generate an ObBuildIndexScheduleTask
, and put it into the ObBuildIndexScheduler ObDDLTaskExecutor
for execution. This executor has four threads, and the queue length is limited to memory. The maximum memory of the task queue is 1GB.
How does this monitoring task come about? When the core service partition_service
of an obs starts, the sub-service ObBuildIndexScheduler
is activated. ObBuildIndexScheduler
has a scheduled task: ObCheckTenantSchemaTask
, which continuously generates the ObTenantDDLCheckSchemaTask
of each tenant and is also executed in the ObDDLTaskExecutor
of ObBuildIndexScheduler
. Please see ObCheckTenantSchemaTask::runTimerTask
for more information.
Code path:
ObBuildIndexScheduleTask::process -> check_partition_need_build_index
-> wait_trans_end -> check_trans_end -> ObPartitionService::check_schema_version_elapsed
-> report_trans_status
-> wait_snapshot_ready -> get_snapshot_version
-> check_rs_snapshot_elapsed -> ObTsMgr::wait_gts_elapse
-> ObPartitionService::check_ctx_create_timestamp_elapsed
-> choose_build_index_replica -> get_candidate_source_replica
-> check_need_choose_replica
-> ObIndexTaskTableOperator::generate_new_build_index_record
-> wait_choose_or_build_index_end -> get_candidate_source_replica
-> check_need_schedule_dag
-> schedule_dag -> ObPartitionStorage::get_build_index_param
-> ObPartitionStorage::get_build_index_context
-> ObBuildIndexDag::init
-> alloc_index_prepare_task -> ObIndexPrepareTask::init
-> ObIDag::add_task
-> ObDagScheduler::add_dag
-> copy_build_index_data -> send_copy_replica_rpc
-> ObPartitionService::check_single_replica_major_sstable_exist
-> unique_index_checking -> ObUniqueCheckingDag::init
-> ObUniqueCheckingDag::alloc_local_index_task_callback
-> ObUniqueCheckingDag::alloc_unique_checking_prepare_task -> ObUniqueCheckingPrepareTask::init
-> ObIDag::add_task
-> ObDagScheduler::add_dag
-> wait_report_status -> check_all_replica_report_build_index_end
The overall process of building a local index is similar to a global index. After the transaction is completed and the snapshot point is available, select a copy to build a single replica. After the single replica is built and the baseline data is copied to other replicas, perform uniqueness checks before the index takes effect. The construction of baseline data is completed through ObBuildIndexDag
, and the uniqueness check is completed by ObUniqueCheckingDag
.
An Interpretation of the Source Code of OceanBase (6): Detailed Explanation of Storage Engine
An Interpretation of the Source Code of OceanBase (8): Submission and Playback of Transaction Logs
OceanBase - September 14, 2022
OceanBase - May 30, 2022
OceanBase - September 14, 2022
ApsaraDB - August 15, 2024
OceanBase - September 9, 2022
OceanBase - September 13, 2022
Multi-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreA low-code development platform to make work easier
Learn MoreA one-stop, multi-channel verification solution
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by OceanBase