Implement environment isolation

Updated at: 2024-11-15 02:57

When you use DataWorks to perform big data development operations, you can isolate environments such as development, testing, and production. If you use DataWorks together with other Alibaba Cloud services, you can also configure the required environment settings and isolate environments based on your business requirements. This topic describes how to implement environment isolation when you use DataWorks together with Data Lake Formation (DLF), Object Storage Service (OSS), and E-MapReduce (EMR).

Background information

Enterprise users may have requirements for creating and isolating different environments, such as development, testing, and production environments, during big data development. After environment isolation is implemented, the physical storage paths of data, the compute engines on which nodes are run, and big data development scripts in different environments are isolated from each other. In addition, strict permission management is imposed on personnel who perform operations in different environments. For example, O&M personnel can use the production environment, and developers can use only the development environment.

In this example, DataWorks is used together with DLF, OSS, and EMR for big data development, and the development environment and production environment are isolated.

  • DataWorks is used to manage the development, O&M, and scheduling of big data jobs.

  • Two EMR clusters are separately used for the development environment and production environment.

  • OSS is used to store actual data.

  • DLF is used to store and manage metadata.

Environment isolation for DLF

  1. Log on to the DLF console. In the left-side navigation pane, choose Metadata > Metadata. On the Metadata page, click the Catalog List tab. On the Catalog List tab, click New Catalog to create two catalogs named dev and prod. The dev catalog is used to store the metadata in the development environment, and the prod catalog is used to store the metadata in the production environment. When you create the catalogs, specify different values for the Location parameter.

    For more information, see Data Catalog.dlf

  2. Find the dev catalog and prod catalog and separately create a database in the catalogs. We recommend that you specify the same name and different OSS paths for the databases. This can facilitate subsequent data migration.

    数据库

Environment isolation for EMR

Log on to the EMR console and create two EMR clusters. Separately configure catalog information for engines in the two EMR clusters. Make sure that the engines in the EMR cluster for the development environment use the dev catalog and the engines in the EMR cluster for the production environment use the prod catalog.

In this example, the Hive engine is used. dlf.catalog.id in the Hive engine in the EMR cluster for the development environment is set to dev, as shown in the following figure. For more information, see Manage configuration items.emr

Important
  • This section describes only how to configure catalog information for the Hive engine in the EMR cluster for the development environment. You must configure catalog information for all types of engines in the two EMR clusters.

  • After you configure catalog information for the two EMR clusters, you must issue the configurations and restart all engines to make the configurations take effect.

Environment isolation for DataWorks

Environment isolation for DataWorks workspaces in basic mode

  1. Log on to the DataWorks console and create two workspaces in basic mode. Use one workspace as the development environment and associate the EMR cluster for the development environment with the workspace. Use the other as the production environment and associate the EMR cluster for the production environment with the workspace. For more information about how to create a workspace, see Create a workspace.

  2. In the workspace that is used as the development environment, create a task, configure scheduling properties for the task, and create a table by executing an SQL statement on the Data Studio page.

    The following code provides an example of the statement that you can use to create a table:

    CREATE TABLE if NOT EXISTS db1.table1 ( id int, name String);
    Note

    The databases created in the DLF catalogs have declared the paths. The paths no longer need to be declared.

  3. Create a workflow in the workspace that is used as the development environment and use the cross-project cloning feature to deploy the workflow to the workspace that is used as the production environment.

    跨项目克隆To deploy the workflow, you must select the workflow and configure workflow settings, such as compute engine mappings and resource groups. For more information about the cross-project cloning feature, see Overview. After the workflow is deployed, you can view the created task in the workspace that is used as the production environment. You can modify, test, and deploy the task based on your business requirements in the production environment.

Environment isolation for DataWorks workspaces in standard mode

For a workspace in standard mode, you can create a workspace-level parameter on the Workspace-level Parameters tab to enable access to different EMR databases in the development and production environments.

  1. Separately specify the names of EMR databases in the development and production environments as values of the workspace-level parameter, and then assign the workspace-level parameter to a variable in the task code. This allows the task to separately access the EMR databases in the development and production environments when the code is run.

    1. Define a workspace-level parameter.

      1. Go to the Data Studio page. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Operation Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.

      2. In the left-side navigation pane of the Operation Center page, click Tenant Schedule Setting. On the page that appears, click the Workspace-level Parameters tab. In the upper-right corner of the Workspace-level Parameters tab, click Create Workspace-level Parameter. In the Create Workspace-level Parameter dialog box, configure the parameters described in the following table.

        Parameter

        Sample value

        Parameter

        Sample value

        Parameter Name

        Specify a name based on your business requirements. In this example, set the value to emr_env.

        Owner

        Select a user from the drop-down list.

        Workspace

        Select a workspace from the drop-down list.

        Parameter Type

        Select Constant (Plaintext).

        Parameter Value (Development Environment)

        Specify the name of the EMR database in the development environment. In this example, set the value to emr_dev.

        Parameter Value (Production Environment)

        Specify the name of the EMR database in the production environment. In this example, set the value to emr_prod.

        Description

        Specify the description of the workspace-level parameter.

    2. Assign the workspace-level parameter to a variable in the related task code as a value to separately access EMR databases in the development and production environments. When you develop a task and configure scheduling properties for the task in Data Studio, click Properties in the right-side navigation pane of the configuration tab of the task, and set the emr_db parameter to the name of the workspace-level parameter in the Scheduling Parameter section. The following figure shows how to perform the operations.

      image

    3. Separately access the EMR databases in the development and production environments. If you deploy a task to the development environment, emr_dev is automatically assigned to the variable as a value. If you deploy a task to the production environment, emr_prod is automatically assigned to the variable as a value.

  • On this page (1, T)
  • Background information
  • Environment isolation for DLF
  • Environment isolation for EMR
  • Environment isolation for DataWorks
  • Environment isolation for DataWorks workspaces in basic mode
  • Environment isolation for DataWorks workspaces in standard mode
Feedback