All Products
Search
Document Center

E-MapReduce:Manage Spark Thrift Servers

Last Updated:Sep 11, 2024

Spark Thrift Server is a service provided by Apache Spark. Spark Thrift Server allows you to connect to Spark and execute SQL queries based on Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC). This helps integrate Spark with existing business intelligence (BI) tools, data visualization tools, and other data analysis tools. This topic describes how to create and connect to a Spark Thrift Server.

Prerequisites

A workspace is created. For more information, see Manage workspaces.

Create a Spark Thrift Server

After a Spark Thrift Server is created, you can select the Spark Thrift Server when you create a Spark SQL task.

  1. Go to the Compute page.

    1. Log on to the EMR console.

    2. In the left-side navigation pane, choose EMR Serverless > Spark.

    3. On the Spark page, click the name of the desired workspace.

    4. In the left-side navigation pane of the EMR Serverless Spark page, choose Admin > Compute.

  2. On the Compute page, click the Spark Thrift Server tab.

  3. Click Create Spark Thrift Server.

  4. On the Create Spark Thrift Server page, configure parameters and click Create. The following table describes the parameters.

    Parameter

    Description

    Name

    The name of the Spark Thrift Server.

    The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), underscores (_), and spaces.

    Resource Queue

    The resource queue that is used to deploy the Spark Thrift Server. Select a resource queue from the drop-down list. Only resource queues that are available in the development environment and resource queues that are available in both the development and production environments are displayed in the drop-down list.

    For more information about resource queues, see Manage resource queues.

    Engine Version

    The version of the engine that is used by the Spark Thrift Server. For more information about engine versions, see Engine versions.

    Use Fusion Acceleration

    Specifies whether to enable Fusion acceleration. The Fusion engine helps accelerate the processing of Spark workloads and lower the overall cost of tasks. For more information about billing, see Billing. For more information about the Fusion engine, see Fusion engine.

    Automatic Stop

    By default, this switch is turned on. By default, if the Spark Thrift Server is not used to run jobs in the recent 45 minutes, the system automatically stops the Spark Thrift Server.

    Spark Thrift Server Port

    The port of the Spark Thrift Server. By default, port 443 is used.

    Authentication Method

    The authentication mode. You can select only Token.

    spark.driver.cores

    The number of CPU cores that are used by the driver of the Spark application. Default value: 1 CPU.

    spark.driver.memory

    The size of memory that is available to the driver of the Spark application. Default value: 3.5 GB.

    spark.executor.cores

    The number of CPU cores that can be used by each executor. Default value: 1 CPU.

    spark.executor.memory

    The size of memory that is available to each executor. Default value: 3.5 GB.

    spark.executor.instances

    The number of executors that are allocated to the Spark application. Default value: 2.

    Dynamic Resource Allocation

    Specifies whether to enable dynamic resource allocation. By default, this switch is turned off. After you turn on the switch, you must configure the following parameters:

    • Minimum Number of Executors: Default value: 2.

    • Maximum Number of Executors: If you do not configure the spark.executor.instances parameter, the default value 10 is used.

    More Memory Configurations (Click to Show)

    • spark.driver.memoryOverhead: the size of non-heap memory that is available to each driver. Default value: 1 GB.

    • spark.executor.memoryOverhead: the size of non-heap memory that is available to each executor. Default value: 1 GB.

    • spark.memory.offHeap.size: the size of off-heap memory that is available to the Spark application. Default value: 1 GB.

      This parameter is valid only if you set the spark.memory.offHeap.enabled parameter to true. By default, if the Fusion engine is used, the spark.memory.offHeap.enabled parameter is set to true and the spark.memory.offHeap.size parameter is set to 1 GB.

    Spark Configuration

    The configurations of Spark. Separate the configurations with spaces. Example: spark.sql.catalog.paimon.metastore dlf.

  5. Obtain the endpoint of the Spark Thrift Server.

    1. On the Spark Thrift Server tab, click the name of the created Spark Thrift Server.

    2. On the Overview tab of the page that appears, copy the endpoint.

Create a token

Note

To use a token, add --header `x-acs-spark-livy-token: token` to the headers of the requests.

  1. On the Spark Thrift Server tab, click the name of the created Spark Thrift Server.

  2. On the page that appears, click the Token Management tab.

  3. On the Token Management tab, click Create Token.

  4. In the Create Token dialog box, configure parameters and click OK. The following table describes the parameters.

    Parameter

    Description

    Name

    The name of the token.

    Expired At

    The validity period of the token. The validity period must be greater than or equal to 1 day. By default, this parameter is enabled and set to 365 days.

  5. Copy the token.

    Important

    After the token is created, you must immediately copy the token. You can no longer view the token after you leave the page. If your token expires or is lost, reset the token or create another token.

Connect to the Spark Thrift Server

When you connect to the Spark Thrift Server, replace the following information based on your business requirements:

  • <endpoint>: the endpoint that you obtain on the Overview tab of the Spark Thrift Server.

  • <username>: the name of the token that you create on the Token Management tab of the Spark Thrift Server.

  • <token>: the token that you copy on the Token Management tab of the Spark Thrift Server.

Use Python to connect to the Spark Thrift Server

  1. Run the following command to install PyHive and Thrift:

    pip install pyhive thrift
  2. Write a Python script to connect to the Spark Thrift Server.

    The following Python sample code provides an example on how to connect to Hive and query databases.

    from pyhive import hive
    
    if __name__ == '__main__':
        # Replace <endpoint>, <username>, and <token> based on your business requirements. 
        cursor = hive.connect('<endpoint>', port=443, scheme='https', username='<username>', password='<token>').cursor()
        cursor.execute('show databases')
        print(cursor.fetchall())
        cursor.close()
    

Use Java to connect to the Spark Thrift Server

  1. Add the following Maven dependencies to the pom.xml file.

    <dependencies>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
                <version>3.0.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hive</groupId>
                <artifactId>hive-jdbc</artifactId>
                <version>2.1.0</version>
            </dependency>
        </dependencies>
    
  2. Write Java code to connect to Spark Thrift Server.

    The following sample Java code provides an example on how to connect to the Spark Thrift Server and query databases.

    import java.sql.Connection;
    import java.sql.DriverManager;
    import java.sql.ResultSet;
    import java.sql.ResultSetMetaData;
    
    public class Main {
        public static void main(String[] args) throws Exception {
            String url = "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice;user=<username>;password=<token>";
            Class.forName("org.apache.hive.jdbc.HiveDriver");
            Connection conn = DriverManager.getConnection(url);
            HiveStatement stmt = (HiveStatement) conn.createStatement();
    
            String sql = "show databases";
            System.out.println("Running " + sql);
            ResultSet res = stmt.executeQuery(sql);
    
            ResultSetMetaData md = res.getMetaData();
            String[] columns = new String[md.getColumnCount()];
            for (int i = 0; i < columns.length; i++) {
                columns[i] = md.getColumnName(i + 1);
            }
            while (res.next()) {
                System.out.print("Row " + res.getRow() + "=[");
                for (int i = 0; i < columns.length; i++) {
                    if (i != 0) {
                        System.out.print(", ");
                    }
                    System.out.print(columns[i] + "='" + res.getObject(i + 1) + "'");
                }
                System.out.println(")]");
            }
            
            conn.close();
        }
    }

Use the Beeline client to connect to the Spark Thrift Server

beeline -u "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice;user=<username>;password=<token>"

Configure Apache Superset to connect to the Spark Thrift Server

Apache Superset is a modern data exploration and visualization platform that supports various types of charts, from simple line charts to highly detailed geospatial charts. For more information about Superset, see Superset.

  1. Install the Thrift dependency.

    Make sure that you installed Thrift of a later version. We recommend that you install Thrift of a version that is later than 16.0.0. If Thrift is not installed, you can run the following command to install Thrift:

    pip install thrift==20.0.0
  2. Start Superset.

    For more information, see Superset.

  3. In the upper-right corner of the page that appears, click DATABASE.

  4. In the Connect a database dialog box, select Apache Spark SQL from the SUPPORTED DATABASES drop-down list.

    image

  5. Enter the connection string and configure the data source parameters.

    hive+https://<username>:<token>@<endpoint>:443/<db_name>
  6. Click FINISH and confirm that the database is connected.

Configure Hue to connect to the Spark Thrift Server

Hue provides the UI that allows you to interact with the Hadoop ecosystem. For more information about Hue, see Hue.

  1. Install the Thrift dependency.

    Make sure that you installed Thrift of a later version. We recommend that you install Thrift of a version that is later than 16.0.0. If Thrift is not installed, you can run the following command to install Thrift:

    pip install thrift==20.0.0
  2. Add the Spark SQL connection string to the configuration file of Hue.

    Find the configuration file of Hue and add the following code to the file. In most cases, the configuration file of Hue is stored in the /etc/hue/hue.conf directory.

       [[[sparksql]]]
         name = Spark Sql
         interface=sqlalchemy
         options='{"url": "hive+https://<username>:<token>@<endpoint>:443/"}'
  3. Restart Hue.

    After you modify the configurations, you must run the following command to restart the Hue service for the modification to take effect:

    sudo service hue restart
  4. Verify the connection.

    After Hue is restarted, access the web UI of Hue and go to the Spark Sql page. If the configurations are correct, you can connect to the Spark Thrift Server and perform SQL queries.

    image

FAQ

What do I do if an error is reported when Beeline is used to connect an EMR on ECS cluster to a Serverless Spark Thrift Server?

  • Description

    When Beeline is used to connect an EMR on ECS cluster to a Serverless Spark Thrift Server, the following error is reported:

    24/08/22 15:09:11 [main]: ERROR jdbc.HiveConnection: Error opening session
    org.apache.thrift.transport.TTransportException: HTTP Response code: 404
  • Cause

    In most cases, the error is caused by a version compatibility conflict. A later version of the Beeline client is integrated with EMR of the latest version. In this case, the Beeline client may be incompatible with the Serverless Spark Thrift Server. As a result, the EMR on ECS cluster fails to connect to the Serverless Spark Thrift Server and the preceding error is reported.

  • Solution

    To ensure that your EMR on ECS cluster can be connected to the Serverless Spark Thrift Server as expected, we recommended that you replace the beeline command with the spark-beeline command. The spark-beeline client is specifically designed to work with Spark Thrift Server. The spark-beeline client can ensure version compatibility and avoid connection errors due to version incompatibility.

    spark-beeline -u "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice;user=<username>;password=<token>"

References

For information about the Fusion engine, see Fusion engine.