All Products
Search
Document Center

AnalyticDB:Custom Spark images

Last Updated:Nov 07, 2024

If the default image of AnalyticDB for MySQL Spark cannot meet your business requirements, you can add the software packages and dependencies required for Spark jobs to the default image to create a custom image and then publish the custom image to Container Registry. When you develop AnalyticDB for MySQL Spark jobs, you can specify the custom image as the execution environment.

Background information

The default image of AnalyticDB for MySQL Spark may fail to meet your business requirements in the following scenarios:

  • Before you execute Spark jobs, a complex initialization operation is required. For example, you must download large model files to a specific on-premises directory or install and upgrade specific Linux development kits.

  • If you use Python to perform machine learning or data mining, the dependent C and C++ versions are incompatible with the default image. virtualenv cannot be used to upload Python environments.

  • You must configure a custom Spark kernel or use the features of a Spark kernel of the preview version.

To resolve the preceding issues, you can use the custom image feature of AnalyticDB for MySQL Spark to replace the default Spark image with a custom image.

Prerequisites

An Elastic Compute Service (ECS) instance is created. In this example, the ECS instance uses the Alibaba Cloud Linux 3.2104 LTS 64-bit OS. For more information, see Create a subscription ECS instance on the Quick Launch tab.

Procedure

  1. Create a Container Registry instance. Create a namespace and an image repository for the instance. For more information, see Create a Container Registry Enterprise Edition instance.

    Container Registry instances are available in Enterprise Edition and Personal Edition. We recommend that you create a Container Registry Enterprise Edition instance to achieve highly efficient distribution and higher security in data access and storage. In this example, an Enterprise Edition instance is created. For more information, see Differences between Personal Edition instances and Enterprise Edition instances.

    Note
    • After you create a custom image, the image is pushed to the Container Registry instance. Then, AnalyticDB for MySQL Spark pulls the image from the instance when submitting jobs.

    • You are charged when you create a Container Registry Enterprise Edition instance. For more information, see Billing of Container Registry Enterprise Edition instances.

  2. Add the virtual private cloud (VPC) and vSwitch of the ECS instance to the access control list (ACL) of the Container Registry instance to establish a connection between the two instances. For more information, see Configure a VPC ACL.

  3. Run the following command on the ECS instance to download the default image of AnalyticDB for MySQL Spark:

     docker pull registry.cn-hangzhou.aliyuncs.com/adb-public-image/adb-spark-public-image:3.2.0.32

    You can run the docker images command to check whether the default image is downloaded to your computer.

    The default image is encapsulated based on the Anolis OS and uses the RPM Package Manager (RPM). The following table describes the preset directories of the default image of AnalyticDB for MySQL Spark. To ensure that the features of AnalyticDB for MySQL Spark work as expected, we recommend that you do not modify the content of the directories.

    Directory

    Content

    Description

    /opt/spark

    The installation directory of AnalyticDB for MySQL Spark.

    /opt/spark/adb_jars provides the built-in Spark connectors. We recommend that you do not overwrite or delete the directory.

    /opt/intel

    The installation directory of the native execution engine.

    If you modify the directory, vectorization may fail.

    /opt/prometheus

    The installation directory of Managed Service for Prometheus.

    If you modify the directory, performance monitoring and diagnostics may fail.

    /opt/tools

    The installation directory of debugging tools.

    We recommend that you install performance tracing and troubleshooting tools in the directory.

    /opt/entrypoint.sh

    The boot script file.

    If you modify the boot script file, Java virtual machines (JVMs) may fail to start and resources are not used in an efficient manner.

    /usr/java/jdk1.8.0_321-amd64

    The installation directory of the Java Development Kit (JDK) that is specified by the JAVA_HOME environment variable.

    To change the JDK, we recommend that you modify the JAVA_HOME environment variable to specify a new installation path.

    /usr/local/lib/python3.6

    The installation directory of third-party packages of the default Python environment.

    To change the Python environment, we recommend that you modify the spark.pyspark.python parameter to specify a new installation path.

    Note

    If you overwrite the default Python version of the Linux system, the RPM may fail to respond.

  4. Run the vim Dockerfile command to create a Dockerfile. Add the following content to the Dockerfile.

    In this example, Python 3.8 and the latest version of Tensorflow are installed on the custom image.

    FROM registry.cn-hangzhou.aliyuncs.com/adb-public-image/adb-spark-public-image:3.2.0.32
    RUN wget https://www.python.org/ftp/python/3.8.12/Python-3.8.12.tgz
    RUN tar xzf Python-3.8.12.tgz
    RUN cd Python-3.8.12 && sudo ./configure --enable-optimizations && sudo make altinstall
    ENV PYSPARK_PYTHON="/usr/local/bin/python3.8"
    RUN echo "import sys\nprint(sys.version)" > /tmp/test.py

    For information about Dockerfiles, see Dockerfile reference.

  5. Build a custom image and push the image to the image repository that is created in Step 1.

    Syntax for building a custom image: docker build -t <Image repository endpoint>:<Custom image version>.

    Syntax for pushing a custom image: docker push <Image repository endpoint>:<Custom image version>.

    Image repository endpoint: You can obtain the endpoint on the Details page of the image repository.

    Sample code:

    docker build -t my-spark-****repo.cn-hangzhou.cr.aliyuncs.com/adb-public/spark:3.2.0.32 .
    docker push my-spark-****repo.cn-hangzhou.cr.aliyuncs.com/test/spark:3.2.0.32

    After you push the image to the image repository, you can view the image on the Tags page of the image repository.

  6. Configure the following parameters in the AnalyticDB for MySQL console to develop Spark jobs. Sample code:

    {
        "file": "local:///tmp/test.py",
        "name": "MyImage",
        "conf": {
            "spark.driver.resourceSpec": "small",
            "spark.executor.instances": 1,
            "spark.executor.resourceSpec": "small",
            "spark.adb.customImage.enabled": "true",
            "spark.kubernetes.container.image": "my-spark-****repo.cn-hangzhou.cr.aliyuncs.com/test/spark:3.2.0.32",
            "spark.adb.acr.instanceId": "cri-jim1hhv0vxqy****",
            "spark.adb.eni.enabled": "true",
            "spark.adb.eni.vswitchId": "vsw-bp1o49nlrhc5q8wxt****",
            "spark.adb.eni.securityGroupId": "sg-bp10fw6e31xyrryu****"
        }
    }

    The following table describes the parameters.

    Parameter

    Example

    Description

    spark.adb.customImage.enabled

    true

    Specifies whether to enable the custom image feature. Set the parameter to true. Valid values:

    • true

    • false (default)

    spark.kubernetes.container.image

    my-spark-****repo.cn-hangzhou.cr.aliyuncs.com/test/spark:3.2.0

    The endpoint of the image repository in the Container Registry instance.

    spark.adb.acr.instanceId

    cri-jim1hhv0vxqy****

    The ID of the Container Registry Enterprise Edition instance. You can obtain the ID in the Container Registry console. You must configure this parameter only for Container Registry Enterprise Edition instances.

    spark.adb.eni.enabled

    true

    Specifies whether to enable Elastic Network Interface (ENI). Set the parameter to true. Valid values:

    • true

    • false (default)

    spark.adb.eni.vswitchId

    vsw-bp1o49nlrhc5q8wxt****

    The vSwitch ID of the ENI. We recommend that you select the vSwitch ID that is added to the ACL of the Container Registry instance in Step 2.

    If you use a Container Registry Personal Edition instance, you can select a vSwitch that has permissions.

    spark.adb.eni.securityGroupId

    sg-bp10fw6e31xyrryu****

    The security group ID of the ENI. The security group must reside in the same VPC as the preceding vSwitch.

    Sample code for a Container Registry Personal Edition instance:

    {
        "file": "local:///tmp/test.py",
        "name": "MyImage",
        "conf": {
            "spark.driver.resourceSpec": "small",
            "spark.executor.instances": 1,
            "spark.executor.resourceSpec": "small",
            "spark.adb.customImage.enabled": "true",
            "spark.kubernetes.container.image": "regi****-vpc.cn-hangzhou.aliyuncs.com/ptshalom/adb_spark:2.0",
            "spark.adb.customImage.username": "db****@test.ali.com",
            "spark.adb.customImage.password": "Data****",
            "spark.adb.eni.enabled": "true",
            "spark.adb.eni.vswitchId": "vsw-bp1o49nlrhc5q8wxt****",
            "spark.adb.eni.securityGroupId": "sg-bp10fw6e31xyrryu****"
        }
    }

    Parameters:

    • spark.adb.customImage.username: the username of the image repository in the Container Registry Personal Edition instance. You must configure this parameter. Default value: the name of the Alibaba Cloud account.

    • spark.adb.customImage.password: the password of the image repository in the Container Registry Personal Edition instance. You must configure this parameter. The value is the password that you specified when you activated Container Registry.

    After you run the preceding code, you can go to the Applications tab and click Log in the Actions column to view log information. For more information about Spark applications, see Overview.

References