Manage Spark Thrift Servers - E-MapReduce - Alibaba Cloud Documentation Center

A Spark Thrift Server is a service provided by Apache Spark. A Spark Thrift Server allows you to connect to Spark and execute SQL queries based on Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC). This helps integrate Spark with existing business intelligence (BI) tools, data visualization tools, and other data analysis tools. This topic describes how to create and connect to a Spark Thrift Server.

Prerequisites

A workspace is created. For more information, see Manage workspaces.

Create a Spark Thrift Server

After a Spark Thrift Server is created, you can select the Spark Thrift Server when you create a Spark SQL job.

Go to the Sessions page.
1. Log on to the EMR console.
2. In the left-side navigation pane, choose EMR Serverless > Spark.
3. On the Spark page, click the name of the workspace that you want to manage.
4. In the left-side navigation pane of the EMR Serverless Spark page, choose Operation Center > Sessions.
On the Sessions page, click the Spark Thrift Server Sessions tab.
Click Create Spark Thrift Server Session.

On the Create Spark Thrift Server Session page, configure parameters and click Create. The following table describes the parameters.

Parameter	Description
Name	The name of the Spark Thrift Server. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), underscores (_), and spaces.
Resource Queue	The resource queue that is used to deploy the Spark Thrift Server. Select a resource queue from the drop-down list. Only resource queues that are available in the development environment and resource queues that are available in both the development and production environments are displayed in the drop-down list. For more information about resource queues, see Manage resource queues.
Engine Version	The version of the engine that is used by the Spark Thrift Server. For more information about engine versions, see Engine versions.
Use Fusion Acceleration	Specifies whether to enable Fusion acceleration. The Fusion engine helps accelerate the processing of Spark workloads and lower the overall cost of jobs. For more information about billing, see Billing. For more information about the Fusion engine, see Fusion engine.
Automatic Stop	By default, this switch is turned on. By default, if the Spark Thrift Server remains inactive for 45 minutes, the system automatically stops the Spark Thrift Server.
Spark Thrift Server Port	The port of the Spark Thrift Server. If you want to use the public endpoint to connect to a Spark Thrift Server, enter 443. If you want to use the internal endopint to connect to a Spark Thrift Server, enter 80.
Authentication Method	The authentication mode. You can select only Token.
spark.driver.cores	The number of CPU cores that are used by the driver of the Spark application. Default value: 1 CPU.
spark.driver.memory	The size of memory that is available to the driver of the Spark application. Default value: 3.5 GB.
spark.executor.cores	The number of CPU cores that can be used by each executor. Default value: 1 CPU.
spark.executor.memory	The size of memory that is available to each executor. Default value: 3.5 GB.
spark.executor.instances	The number of executors that are allocated to the Spark application. Default value: 2.
Dynamic Resource Allocation	By default, this feature is disabled. After you enable this feature, you must configure the following parameters: Minimum Number of Executors: Default value: 2. Maximum Number of Executors: If you do not configure the spark.executor.instances parameter, the default value 10 is used.
More Memory Configurations	spark.driver.memoryOverhead: the size of non-heap memory that is available to each driver. If you leave this parameter empty, Spark automatically assigns a value to this parameter based on the following formula: `max(384 MB, 10% × spark.driver.memory)`. spark.executor.memoryOverhead: the size of non-heap memory that is available to each executor. If you leave this parameter empty, Spark automatically assigns a value to this parameter based on the following formula: `max(384 MB, 10% × spark.executor.memory)`. spark.memory.offHeap.size: the size of off-heap memory that is available to the Spark application. Default value: 1 GB. This parameter is valid only if you set the `spark.memory.offHeap.enabled` parameter to `true`. By default, if you use the Fusion engine, the spark.memory.offHeap.enabled parameter is set to true and the spark.memory.offHeap.size parameter is set to 1 GB.
Spark Configuration	The configurations of Spark. Separate the configurations with spaces. Example: `spark.sql.catalog.paimon.metastore dlf`.

Obtain the endpoint of the Spark Thrift Server.
1. On the Spark Thrift Server Sessions tab, click the name of the created Spark Thrift Server.
2. On the Overview tab of the page that appears, copy the endpoint.

Create a token

On the Spark Thrift Server Sessions tab, click the name of the created Spark Thrift Server.
On the page that appears, click the Tokens tab.
On the Tokens tab, click Create Token.

In the Create Token dialog box, configure parameters and click OK. The following table describes the parameters.

Parameter	Description
Name	The name of the token.
Expired At	The validity period of the token. The validity period must be greater than or equal to 1 day. By default, this parameter is enabled and set to 365 days.

Copy the token.
Important
After the token is created, you must immediately copy the token. You can no longer view the token after you leave the page. If your token expires or is lost, reset the token or create another token.

Connect to the Spark Thrift Server

When you connect to the Spark Thrift Server, replace the following information based on your business requirements:

<endpoint>: the endpoint that you obtain on the Overview tab of the Spark Thrift Server.
<username>: the name of the token that you create on the Tokens tab of the Spark Thrift Server.
<port>: the port number. If you use the public endpoint to connect to the Spark Thrift Server, the port number is 443. If you use the internal endopint to connect to the Spark Thrift Server, the port number is 80.
<token>: the token that you copy on the Tokens tab of the Spark Thrift Server.

Use Python to connect to the Spark Thrift Server

Run the following command to install PyHive and Thrift:
```
pip install pyhive thrift
```

Write a Python script to connect to the Spark Thrift Server.

The following Python sample code provides an example on how to connect to Hive and query databases.

from pyhive import hive

if __name__ == '__main__':
    # Replace <endpoint>, <username>, and <token> based on your business requirements. 
    cursor = hive.connect('<endpoint>', port=<port>, scheme='https', username='<username>', password='<token>').cursor()
    cursor.execute('show databases')
    print(cursor.fetchall())
    cursor.close()

Use Java to connect to the Spark Thrift Server

Add the following Maven dependencies to the pom.xml file.

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>2.1.0</version>
        </dependency>
    </dependencies>

Note

The version of Hive that is built in Serverless Spark is 2.x. Therefore, only hive-jdbc 2.x is supported.

Write Java code to connect to Spark Thrift Server.

The following sample Java code provides an example on how to connect to the Spark Thrift Server and query databases.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;

public class Main {
    public static void main(String[] args) throws Exception {
        String url = "jdbc:hive2://<endpoint>:<port>/;transportMode=http;httpPath=cliservice/token/<token>";
        Class.forName("org.apache.hive.jdbc.HiveDriver");
        Connection conn = DriverManager.getConnection(url);
        HiveStatement stmt = (HiveStatement) conn.createStatement();

        String sql = "show databases";
        System.out.println("Running " + sql);
        ResultSet res = stmt.executeQuery(sql);

        ResultSetMetaData md = res.getMetaData();
        String[] columns = new String[md.getColumnCount()];
        for (int i = 0; i < columns.length; i++) {
            columns[i] = md.getColumnName(i + 1);
        }
        while (res.next()) {
            System.out.print("Row " + res.getRow() + "=[");
            for (int i = 0; i < columns.length; i++) {
                if (i != 0) {
                    System.out.print(", ");
                }
                System.out.print(columns[i] + "='" + res.getObject(i + 1) + "'");
            }
            System.out.println(")]");
        }
        
        conn.close();
    }
}

Use the Spark Beeline client to connect to the Spark Thrift Server

Optional. If you use an EMR on ECS cluster, go to the bin directory of Spark.
```
cd /opt/apps/SPARK3/spark-3.4.2-hadoop3.2-1.0.3/bin/
```
Run the following command to connect to the Spark Thrift Server:
```
beeline -u "jdbc:hive2://<endpoint>:<port>/;transportMode=http;httpPath=cliservice/token/<token>"
```
If the following error occurs when you run the command, the Hive Beeline client may be incompatible with the Serverless Spark Thrift Server. Make sure that you use the Spark Beeline client to connect to the Spark Thrift Server.
```
24/08/22 15:09:11 [main]: ERROR jdbc.HiveConnection: Error opening session
org.apache.thrift.transport.TTransportException: HTTP Response code: 404
```

Configure Apache Superset to connect to the Spark Thrift Server

Apache Superset is a modern data exploration and visualization platform that supports various types of charts, from simple line charts to highly detailed geospatial charts. For more information about Superset, see Superset.

Install the Thrift dependency.
Make sure that you installed Thrift of a later version. We recommend that you install Thrift of a version that is later than 16.0.0. If Thrift is not installed, you can run the following command to install Thrift:
```
pip install thrift==20.0.0
```
Start Superset.
For more information, see Superset.
In the upper-right corner of the page that appears, click DATABASE.
In the Connect a database dialog box, select Apache Spark SQL from the SUPPORTED DATABASES drop-down list.
Enter the connection string and configure the data source parameters.
```
hive+https://<username>:<token>@<endpoint>:<port>/<db_name>
```
Click FINISH and confirm that the database is connected.

Configure Hue to connect to the Spark Thrift Server

Hue provides the UI that allows you to interact with the Hadoop ecosystem. For more information about Hue, see Hue.

Install the Thrift dependency.
Make sure that you installed Thrift of a later version. We recommend that you install Thrift of a version that is later than 16.0.0. If Thrift is not installed, you can run the following command to install Thrift:
```
pip install thrift==20.0.0
```
Add the Spark SQL connection string to the configuration file of Hue.
Find the configuration file of Hue and add the following code to the file. In most cases, the configuration file of Hue is stored in the /etc/hue/hue.conf directory.
```
      [[[sparksql]]]
     name = Spark Sql
     interface=sqlalchemy
     options='{"url": "hive+https://<username>:<token>@<endpoint>:<port>/"}'
```
Restart Hue.
After you modify the configurations, you must run the following command to restart the Hue service for the modification to take effect:
```
sudo service hue restart
```
Verify the connection.
After Hue is restarted, access the web UI of Hue and go to the Spark Sql page. If the configurations are correct, you can connect to the Spark Thrift Server and perform SQL queries.

References

For information about the Fusion engine, see Fusion engine.