An Interpretation of PolarDB-X Source Codes (9): Life of DDL

Overview

After an SQL statement enters the CN of PolarDB-X, it will go through the complete processing of the protocol layer, optimizer, and executor. First, after parsing, authentication, and verification, it is parsed into a relational algebra tree. Then, an execution plan is generated through RBO and CBO in the optimizer, and finally, the execution is completed on the DN. It is different from DML; logical DDL statements involve the read and write of metadata and physical DDL statements, which directly affect the consistency of system states.

The key goals of PolarDB-X DDL implementation are the online and crash safe of DDL, which means the concurrency between DDL and DML and the atomicity and durability of DDL itself. The idea of PolarDB-X following online schema change introduces dual-version metadata in CN for complex logical DDL (such as adding and subtracting global secondary indexes and migrating partition data). DDL only occupies MDL locks when emptying low-version transactions, which reduces the frequency of blocking business SQL. The DDL engine unifies the definition method of logical DDL and the scheduling execution process in the executor module. Developers only need to encapsulate physical interfaces as Tasks with timing dependencies and crash recovery methods and pass them to the DDL engine, which guarantees the corresponding execution and rollback logic.

This article mainly explains the implementation of PolarDB-X DDL in the compute node (CN). For simplicity, we will skip the pre-processing of DDL in the protocol layer and optimizer and the dispatch of the Handler to the executor. For this part, please refer to the following article on source code interpretation:

We will focus on the execution process of DDL in the executor. Before reading this article, let’s review some knowledge related to PolarDB-X online schema change, metadata locking, and DDL engine principles:

Overall Process

1. A logical DDL statement is parsed and then entered into the optimizer. After a simple type conversion, a logicalPlan and an executionContext are generated. The executor assigns the logicalPlan to the corresponding DDLHandler based on the type of the logicalPlan. The public base class of DDLHandler is LogicalCommonDdlHandler. Please see the com.alibaba.polardbx.executor.handler.ddl.LogicalCommonDdlHandler#handle for its public execution entry.

    public Cursor handle(RelNode logicalPlan, ExecutionContext executionContext) {
        BaseDdlOperation logicalDdlPlan = (BaseDdlOperation) logicalPlan;

        initDdlContext(logicalDdlPlan, executionContext);

        // Validate the plan first and then return immediately if needed.
        boolean returnImmediately = validatePlan(logicalDdlPlan, executionContext);

        .....
        setPartitionDbIndexAndPhyTable(logicalDdlPlan);
        // Build a specific DDL job by subclass that override buildDdlJob
        DdlJob ddlJob = returnImmediately?
            new TransientDdlJob():
            // @override
            buildDdlJob(logicalDdlPlan, executionContext);

        // Validate the DDL job before request.
        // @override
        validateJob(logicalDdlPlan, ddlJob, executionContext);

        // Handle the client DDL request on the worker side.
        handleDdlRequest(ddlJob, executionContext);
        .....
        return buildResultCursor(logicalDdlPlan, executionContext);
    }

2. Common context information (such as traceId, transaction ID, ddl type, and global configuration) is first stripped from the executionContext in the handle method. Then, the validateJob method is used to verify the correctness of DDL. The buildDdlJob method constructs DDL tasks. For example, AlterTableHandler, a DDL Handler overloads these two methods. Developers define and verify the DDL job based on the semantics of the DDL.

3. At this point, DDL is converted from logicalPlan to a DDL job in the Handler. The com.alibaba.polardbx.executor.handler.ddl.LogicalCommonDdlHandler#handle DdlRequest method forwards the DDL request corresponding to the DDL job to the DDL engine on the Leader CN node for scheduling execution.

4. As a daemon process, the DDL engine only runs on the Leader CN node and polls DDLJobQueue. If the queue is not empty, the scheduling and execution process of the DDL Job is triggered. The DDL engine maintains two-level queues, which are instance-level and database-level DDL job queues. This ensures that DDL statements are concurrently executed between logical databases without interference.

It is worth noting that during the process of the DDL Requester pushing the DDL request to the DDL engine, there is an inter-node communication from the Worker Node to the Leader Node. This process depends on the method provided by com.alibaba.polardbx.gms.sync.GmsSyncManagerHelper#sync. The node that initiates communication encapsulates syncAction as information in sync, and the receiver executes corresponding actions according to syncAction after listening to the occurrence of sync. This inter-node communication mechanism exists in the process of DDL engine loading DDL jobs and in the process of metadata synchronization between points when DDL jobs are executed (please see the upcoming article). In both scenarios, the encapsulated syncAction triggers the receiver to pull the communication content (DDL job or metadata) from MetaDB. This communication mechanism uses database services as communication intermediaries to ensure high availability and consistency of communication content delivery.

DDL job is the core concept of the DDL execution engine. It is a DDL transaction simulated at the DDL engine level. It attempts to implement atomicity and persistence in logical DDL composed of a series of heterogeneous interfaces (such as metadata modification, inter-node communication, physical DDL, and DML) to ensure the stability of the system state. Based on the common transaction implementation, transaction elements (such as undolog, lock, WAL, and version snapshot) are implemented in parallel in DDL jobs and DDL engines.

The DDL engine entry (the preceding handleDdlRequest method) is used as the demarcation point of DDL data traffic, dividing the lifecycle of DDL into definition and execution. Next, we will use adding a global secondary index as an example to explain the key logic of building DDL atomicity and the collaboration process between DDL and DML when defining DDL jobs. The scheduling execution and error handling mechanism of DDL in the DDL engine will be explained in the next section.

The SQL Statements Involved in This Article

create database db1 mode = "auto";

use db1;

create table t1(x int, y int);

The DDL Statements Explained in Detail in This Article

alter table t1 add global index `g_i_y` (`y`) COVERING (`x`) partition by hash(`y`);

Metadata Management

As mentioned earlier, except for the physical DDL and DML executed on the DN, the key interfaces included in DDL Job are modification and synchronization of metadata.

The metadata in the PolarDB-X is managed by GMS uniformly and can be divided into two parts by physical location:

Persistent Metadata in MetaDB: For example, the metadata table that describes partition table t1 includes table_partitions, tables, columns, and indexes. The read and write interfaces of metadata in MetaDB are distributed to auxiliary classes in the polardbx-gms/src/main/java/com/alibaba/polardbx/gms path.
The Memory Metadata of the CN Cache: Due to performance considerations, the CN node caches metadata partition table t1 in memory. This information is metadata used by DML transactions on the CN. The com.alibaba.polardbx.executor.gms implements cache management of different types of metadata and its interface is defined as com.alibaba.polardbx.optimizer.config.table.SchemaManager, including init, getTable, reload, and other basic methods.

    private ResultCursor executeQuery(ByteString sql, ExecutionContext executionContext,
                                      AtomicBoolean trxPolicyModified) {

        // Get all meta version before optimization
        final long[] metaVersions = MdlContext.snapshotMetaVersions();

        // Planner
        ExecutionPlan plan = Planner.getInstance().plan(sql, executionContext);

        ...
        //for requireMDL transaction
        if (requireMdl && enableMdl) {
            if (!isClosed()) {
                // Acquire meta data lock for each statement modifies table data
                acquireTransactionalMdl(sql.toString(), plan, executionContext);
            }
            ...
            //update Plan if metaVersion changed, which indicate meta updated
        }
        ...

        ResultCursor resultCursor = executor.execute(plan, executionContext);
        ...
        return resultCursor;
    }

It is worth noting that the memory meta-information builds a lock based on a logical table. This lock is a memory MDL lock. When a DML transaction starts, the acquireTransactionalMdl attempts to obtain the MDL lock of the current latest version of the relevant logical table and releases it after the execution is completed.

When a DDL Job is executed, the metadata of the preceding two parts is modified according to a certain time series. Synchronization between DDL and DML is a key step to implement the online feature of DDL in this process.

The Definition of DDL Job

The DDL Job Handler for adding the global secondary index is AlterTableHandler, which overloads the buildDdlJob method (com.alibaba.polardbx.executor.handler.ddl.LogicalAlterTableHandler#buildDdlJob) of LogicalCommonDdlHandler. This method is eventually dispatched to the DDL Job corresponding to the construction global secondary index (com.alibaba.polardbx.executor.ddl.job.factory.gsi.CreatePartitionGsiJobFactory).

public class CreatePartitionGsiJobFactory extends CreateGsiJobFactory{
    @Override
    protected void excludeResources(Set<String> resources) {
        super.excludeResources(resources);
        //meta data lock in MetaDB
        resources.add(concatWithDot(schemaName, primaryTableName)); //db1.t1
        resources.add(concatWithDot(schemaName, indexTableName)); //db1.g_i_y
        ...
    }    

    @Override
    protected ExecutableDdlJob doCreate() {
      ...
        if (needOnlineSchemaChange) {
            bringUpGsi = GsiTaskFactory.addGlobalIndexTasks(
                schemaName,
                primaryTableName,
                indexTableName,
                stayAtDeleteOnly,
                stayAtWriteOnly,
                stayAtBackFill
            );
        }
      ...
        List<DdlTask> taskList = new ArrayList<>();
        //1. validate
        taskList.add(validateTask);
        //2. create gsi table
        //2.1 insert tablePartition meta for gsi table
        taskList.add(createTableAddTablesPartitionInfoMetaTask);
        //2.2 create gsi physical table
        CreateGsiPhyDdlTask createGsiPhyDdlTask =
            new CreateGsiPhyDdlTask(schemaName, primaryTableName, indexTableName, physicalPlanData);
        taskList.add(createGsiPhyDdlTask);
        //2.3 insert tables meta for gsi table
        taskList.add(addTablesMetaTask);
        taskList.add(showTableMetaTask);
        //3. 
        //3.1 insert indexes meta for primary table
        taskList.add(addIndexMetaTask);
        //3.2 gsi status: CREATING -> DELETE_ONLY -> WRITE_ONLY -> WRITE_REORG -> PUBLIC
        taskList.addAll(bringUpGsi);
        //last tableSyncTask
        DdlTask tableSyncTask = new TableSyncTask(schemaName, indexTableName);
        taskList.add(tableSyncTask);

        final ExecutableDdlJob4CreatePartitionGsi result = new ExecutableDdlJob4CreatePartitionGsi();
        result.addSequentialTasks(taskList);
        ....

        return result;
    }
    ...
}

The excludeResources method declares the occupation of persistent metadata locks for related objects by ddlJob. In this example, the metadata of the primary table and the GSI table is modified. Therefore, the locked objects include db1.t1 and db1.g_i_y. Note: This lock is different from the aforementioned memory MDL lock. The former is persisted to MetaDB's read_write_lock table during the initiation when the DDL engine executing DDL Jobs to control the concurrency between DDL.

The doCreate method declares a series of Tasks that are executed in time series. Its semantics are listed below:

Verification
- validateTask
- DDL Job validation
  Checks the validity of the GSI table name
Create a GSI Table
- createTableAddTablesPartitionInfoMetaTask
  refers to GSI partition metadata writing
- createGsiPhyDdlTask
  refers to the creation of the physical DDL of the physical table corresponding to the GSI
- addTablesMetaTask
  refers to the metadata modification of the primary table
- showTableMetaTask
  refers to the metadata of the primary table broadcasting between nodes
Sync Metadata of GSI Tables
- addIndexMetaTask
  refers to the modification of tableMeta metadata of a GSI table
- bringUpGsi
  refers to the online schema change process after adding an index
- tableSyncTask
  refers to the metadata of the GSI table broadcasting between nodes

In the Tasks above, the write operation of persistent meta-information is applied to the corresponding meta-information table in MetaDB, and the broadcast operation of meta-information is applied to the meta-information cached by CN.

Next, we briefly introduce three typical Tasks, of which methods included in CreateTableAddTablesMetaTask are listed below:

public class CreateTableAddTablesMetaTask extends BaseGmsTask {
    @Override
    public void executeImpl(Connection metaDbConnection, ExecutionContext executionContext) {
        PhyInfoSchemaContext phyInfoSchemaContext = TableMetaChanger.buildPhyInfoSchemaContext(schemaName,
            logicalTableName, dbIndex, phyTableName, sequenceBean, tablesExtRecord, partitioned, ifNotExists, sqlKind,
            executionContext);
        FailPoint.injectRandomExceptionFromHint(executionContext);
        FailPoint.injectRandomSuspendFromHint(executionContext);
        TableMetaChanger.addTableMeta(metaDbConnection, phyInfoSchemaContext);
    }

    @Override
    public void rollbackImpl(Connection metaDbConnection, ExecutionContext executionContext) {
        TableMetaChanger.removeTableMeta(metaDbConnection, schemaName, logicalTableName, false, executionContext);
    }

    @Override
    protected void onRollbackSuccess(ExecutionContext executionContext) {
        TableMetaChanger.afterRemovingTableMeta(schemaName, logicalTableName);
    }
}

1. addTableMetaTask

Refers to adding the metadata related to the GSI table to the tableMeta of the primary table. The executeImpl calls the metadata read/write interface (com.alibaba.polardbx.executor.ddl.job.meta.TableMetaChanger#addTableMeta) provided by MetaDB to write metadata to a GSI table. In addition to the core logic corresponding to executeImpl, addTableMetaTask provides the rollbackImpl to clear the metadata of the GSI table. This is the **undolog88 corresponding to the Task. We can call this method to restore the original tableMeta information when a DDL job is rolled back.

2. TableSyncTask

As a special case of the preceding inter-node communication mechanism, it synchronizes tableMeta metadata among CN nodes and notifies all CN nodes in the cluster to load the metadata of tableMeta from MetaDB.

3. bringUpGsi

It is a task list composed of a series of tasks. It defines the evolution of metadata and the data backfilling process concerning online schema change.

In addition to the preceding tasks, the meta information read/write and physical DDL and DML Tasks commonly used in logical DDL have been predefined.

polardbx-executor/src/main/java/com/alibaba/polardbx/executor/ddl/job/task/basic

Readers can find them in this directory.

At the end of the doCreate method, addSequentialTasks add tasks in batches to the created DDLJob. As the simplest task combination operation, addSequentialTask constructs dependencies between tasks based on the subscript order of the elements in the parameter taskList as the topological order. This method indicates DDL Tasks are combined by DAG in DDLJob. The DDL engine schedules DDL Tasks based on the topological order of the DAG. When declaring complex dependencies, we can also use methods (such as addTask and addTaskRelationShip) to declare tasks and their dependency order separately.

The Synchronization of DDL and DML

The bringUpGsi in the previous section is a complete implementation of the online-schema-change process, defined in:

com.alibaba.polardbx.executor.ddl.job.task.factory.GsiTaskFactory#addGlobalIndexTasks

We use this method as an example to describe several key components of DDL and DML synchronization.

•    public static List<DdlTask> addGlobalIndexTasks(String schemaName,
                                                    String primaryTableName,
                                                    String indexName,
                                                    boolean stayAtDeleteOnly,
                                                    boolean stayAtWriteOnly,
                                                    boolean stayAtBackFill) {
        ....
        DdlTask writeOnlyTask = new GsiUpdateIndexStatusTask(
            schemaName,
            primaryTableName,
            indexName,
            IndexStatus.DELETE_ONLY,
            IndexStatus.WRITE_ONLY
        ).onExceptionTryRecoveryThenRollback();
        ....
        taskList.add(deleteOnlyTask);
        taskList.add(new TableSyncTask(schemaName, primaryTableName));
        ....
        taskList.add(writeOnlyTask);
        taskList.add(new TableSyncTask(schemaName, primaryTableName));
        ...
        taskList.add(new LogicalTableBackFillTask(schemaName, primaryTableName, indexName));
        ...
        taskList.add(writeReOrgTask);
        taskList.add(new TableSyncTask(schemaName, primaryTableName));
        taskList.add(publicTask);
        taskList.add(new TableSyncTask(schemaName, primaryTableName));
        return taskList;
    }

The Multi-Version Evolution Process of Metadata: The metadata of an index goes through a complete evolution of version from CREATING → DELETE_ONLY → WRITE_ONLY → WRITE_REORG → PUBLIC. The design process ensures that metadata will be consistent when there are two versions in the system.
The Rewriting of DML in the Metadata Corresponds to the CN Cache: The DML statement that enters the executor will assign the corresponding rewriter to generate DML based on the index state during metadata version evolution. This meets the metadata state constraints (such as write_only and update_only) and implements DML double write during data backfilling.
Synchronize the Timing between the Metadata of MetaDB and CN Cache: The sync operation modifies the metadata of the preceding two parts according to a certain time sequence. On the one hand, it ensures the availability of the metadata of the corresponding version during DML transactions. Meanwhile, it preempts the MDL lock and invalidates the original information after the corresponding DML transaction ends. On the other hand, it ensures that other DML transactions can always obtain the available metadata at the same time by loading the new version of the metadata in advance.

The Rewriting of DML

Please see the section below for the state definition in the index metadata:

com.alibaba.polardbx.gms.metadb.table.IndexStatus.

While DDL updates this field, the optimizer assigns the corresponding gsiWriter in RBO for the DML statement that modifies the primary table, taking a simple Insert statement:

insert into t1 values(1,2)

For example:

1. In the corresponding RBO phase, insert statements polardbx-optimizer/src/main/java/com/alibaba/polardbx/optimizer/core/planner/rule/OptimizeLogicalInsertRule.java is associated with the writer of the corresponding GSI in the execution plan according to gsiMeta in the metadata of the main table.

//OptimizeLogicalInsertRule.java
private LogicalInsert handlePushdown(LogicalInsert origin, boolean deterministicPushdown, ExecutionContext ec){
      ...//other writers
        final List<InsertWriter> gsiInsertWriters = new ArrayList<>();
    IntStream.range(0, gsiMetas.size()).forEach(i -> {
        final TableMeta gsiMeta = gsiMetas.get(i);
        final RelOptTable gsiTable = catalog.getTableForMember(ImmutableList.of(schema, gsiMeta.getTableName()));
        final List<Integer> gsiValuePermute = gsiColumnMappings.get(i);
        final boolean isGsiBroadcast = TableTopologyUtil.isBroadcast(gsiMeta);
        final boolean isGsiSingle = TableTopologyUtil.isSingle(gsiMeta);
        //different write Strategy for corresponding table type.
        gsiInsertWriters.add(WriterFactory
                             .createInsertOrReplaceWriter(newInsert, gsiTable, sourceRowType, gsiValuePermute, gsiMeta, gsiKeywords,
                                                          null, isReplace, isGsiBroadcast, isGsiSingle, isValueSource, ec));
        });
   ...
}

2. The writer applies to the physical execution plan in the Handler of the executor. The getInput method determines whether to generate physical writes to a GSI by DML based on the state information of indexStatus contained in gsiWriter. Then, DML enables double writes to a GSI only after the index is updated to WRITE_ONLY.

//LogicalInsertWriter.java    
protected int executeInsert(LogicalInsert logicalInsert, ExecutionContext executionContext,
                                HandlerParams handlerParams) {
        ...
       final List<InsertWriter> gsiWriters = logicalInsert.getGsiInsertWriters();
       gsiWriters.stream()
                .map(gsiWriter -> gsiWriter.getInput(executionContext))
                .filter(w -> !w.isEmpty())
                .forEach(w -> {
                    writableGsiCount.incrementAndGet();
                    allPhyPlan.addAll(w);
                });

//IndexStatus.java
...
public static final EnumSet<IndexStatus> WRITABLE = EnumSet.of(WRITE_ONLY, WRITE_REORG, PUBLIC, DROP_WRITE_ONLY);

public boolean isWritable() {
   return WRITABLE.contains(this);
}
...

Synchronizing Time Series of Metadata

When synchronizing metadata of MetaDB and memory through sync, follow these steps:

loadSchema: preloads the new metadata in MetaDB to the memory. After loading, the new version metadata and its MDL lock are added to the memory. Then, DML (node1.thread2) entering CN will obtain the MDL of the new version metadata.
mdl(v0).writeLock: attempts to obtain the MDL lock (node1.mdl.v0) of the old version metadata. When the lock is obtained, the transaction of the old version is emptied.
expireSchemaManager(t1, g_i1, v0): removes old version metadata and marks the new version as an old one.

The responders of the sync calls above are all nodes in the CN cluster (including the caller itself). When all nodes complete the metadata version switching, the sync call is successful. As such, the following points are guaranteed:

At most two versions of metadata exist in the entire cluster at any time, and at least one version of metadata can be added with MDL read locks. Therefore, MDL statements that enter the CN will never be blocked.
The old version metadata is no longer associated with MDL transactions after the loadSchema is used. This ensures a limited time for state switching.

Among them, steps 1, 2, and 3 are implemented below:

com.alibaba.polardbx.executor.gms.GmsTableMetaManager#tonewversion

   public void tonewversion(String tableName, boolean preemptive, Long initWait, Long interval, TimeUnit timeUnit) {
        synchronized (OptimizerContext.getContext(schemaName)) {
            GmsTableMetaManager oldSchemaManager =
                (GmsTableMetaManager) OptimizerContext.getContext(schemaName).getLatestSchemaManager();
            TableMeta currentMeta = oldSchemaManager.getTableWithNull(tableName);

            long version = -1;


            ... // Query the metadata version in the current MetaDB and assign it to vesion.

            //1. loadSchema
            SchemaManager newSchemaManager =
                new GmsTableMetaManager(oldSchemaManager, tableName, rule);
            newSchemaManager.init();

            OptimizerContext.getContext(schemaName).setSchemaManager(newSchemaManager);

            //2. mdl(v0).writeLock
            final MdlContext context;
            if (preemptive) {
                context = MdlManager.addContext(schemaName, initWait, interval, timeUnit);
            } else {
                context = MdlManager.addContext(schemaName, false);
            }

            MdlTicket ticket = context.acquireLock(new MdlRequest(1L,
                    MdlKey
                        .getTableKeyWithLowerTableName(schemaName, currentMeta.getDigest()),
                    MdlType.MDL_EXCLUSIVE,
                    MdlDuration.MDL_TRANSACTION));

            //3. expireSchemaManager(t1, g_i1, v0)
            oldSchemaManager.expire();


            .... // The PlanCache that uses the old version of the metadata is invalid.

            context.releaseLock(1L, ticket);
        } 
    }

DDL Jobs are synchronized between DDL and DML when they are defined using the preceding elements. This ensures the online execution of DML.

Summary

This article mainly interprets the code related to the DDL Job definition on the CN side in PolarDB-X, takes adding a global secondary index as an example, sorts the overall process of DDL Job definition and execution, and focuses on the key logic related to online and crash safe features in DDL job definition. Please stay tuned for more information about the DDL Job execution process in the DDL engine.

Community

An Interpretation of PolarDB-X Source Codes (9): Life of DDL

Overview

Overall Process

The SQL Statements Involved in This Article

The DDL Statements Explained in Detail in This Article

Metadata Management

The Definition of DDL Job

The Synchronization of DDL and DML

The Rewriting of DML

Synchronizing Time Series of Metadata

Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

Managed Service for Prometheus

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL