Spark Thrift Server is a service provided by Apache Spark. Spark Thrift Server allows you to connect to Spark and execute SQL queries based on Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC). This helps integrate Spark with existing business intelligence (BI) tools, data visualization tools, and other data analysis tools. This topic describes how to create and connect to a Spark Thrift Server.
Prerequisites
A workspace is created. For more information, see Manage workspaces.
Create a Spark Thrift Server
After a Spark Thrift Server is created, you can select the Spark Thrift Server when you create a Spark SQL task.
Go to the Compute page.
Log on to the EMR console.
In the left-side navigation pane, choose
.On the Spark page, click the name of the desired workspace.
In the left-side navigation pane of the EMR Serverless Spark page, choose Admin > Compute.
On the Compute page, click the Spark Thrift Server tab.
Click Create Spark Thrift Server.
On the Create Spark Thrift Server page, configure parameters and click Create. The following table describes the parameters.
Parameter
Description
Name
The name of the Spark Thrift Server.
The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), underscores (_), and spaces.
Resource Queue
The resource queue that is used to deploy the Spark Thrift Server. Select a resource queue from the drop-down list. Only resource queues that are available in the development environment and resource queues that are available in both the development and production environments are displayed in the drop-down list.
For more information about resource queues, see Manage resource queues.
Engine Version
The version of the engine that is used by the Spark Thrift Server. For more information about engine versions, see Engine versions.
Use Fusion Acceleration
Specifies whether to enable Fusion acceleration. The Fusion engine helps accelerate the processing of Spark workloads and lower the overall cost of tasks. For more information about billing, see Billing. For more information about the Fusion engine, see Fusion engine.
Automatic Stop
By default, this switch is turned on. By default, if the Spark Thrift Server is not used to run jobs in the recent 45 minutes, the system automatically stops the Spark Thrift Server.
Spark Thrift Server Port
The port of the Spark Thrift Server. By default, port 443 is used.
Authentication Method
The authentication mode. You can select only Token.
spark.driver.cores
The number of CPU cores that are used by the driver of the Spark application. Default value: 1 CPU.
spark.driver.memory
The size of memory that is available to the driver of the Spark application. Default value: 3.5 GB.
spark.executor.cores
The number of CPU cores that can be used by each executor. Default value: 1 CPU.
spark.executor.memory
The size of memory that is available to each executor. Default value: 3.5 GB.
spark.executor.instances
The number of executors that are allocated to the Spark application. Default value: 2.
Dynamic Resource Allocation
Specifies whether to enable dynamic resource allocation. By default, this switch is turned off. After you turn on the switch, you must configure the following parameters:
Minimum Number of Executors: Default value: 2.
Maximum Number of Executors: If you do not configure the spark.executor.instances parameter, the default value 10 is used.
More Memory Configurations (Click to Show)
spark.driver.memoryOverhead: the size of non-heap memory that is available to each driver. Default value: 1 GB.
spark.executor.memoryOverhead: the size of non-heap memory that is available to each executor. Default value: 1 GB.
spark.memory.offHeap.size: the size of off-heap memory that is available to the Spark application. Default value: 1 GB.
This parameter is valid only if you set the
spark.memory.offHeap.enabled
parameter totrue
. By default, if the Fusion engine is used, the spark.memory.offHeap.enabled parameter is set to true and the spark.memory.offHeap.size parameter is set to 1 GB.
Spark Configuration
The configurations of Spark. Separate the configurations with spaces. Example:
spark.sql.catalog.paimon.metastore dlf
.Obtain the endpoint of the Spark Thrift Server.
On the Spark Thrift Server tab, click the name of the created Spark Thrift Server.
On the Overview tab of the page that appears, copy the endpoint.
Create a token
To use a token, add --header `x-acs-spark-livy-token: token`
to the headers of the requests.
On the Spark Thrift Server tab, click the name of the created Spark Thrift Server.
On the page that appears, click the Token Management tab.
On the Token Management tab, click Create Token.
In the Create Token dialog box, configure parameters and click OK. The following table describes the parameters.
Parameter
Description
Name
The name of the token.
Expired At
The validity period of the token. The validity period must be greater than or equal to 1 day. By default, this parameter is enabled and set to 365 days.
Copy the token.
ImportantAfter the token is created, you must immediately copy the token. You can no longer view the token after you leave the page. If your token expires or is lost, reset the token or create another token.
Connect to the Spark Thrift Server
When you connect to the Spark Thrift Server, replace the following information based on your business requirements:
<endpoint>
: the endpoint that you obtain on the Overview tab of the Spark Thrift Server.<username>
: the name of the token that you create on the Token Management tab of the Spark Thrift Server.<token>
: the token that you copy on the Token Management tab of the Spark Thrift Server.
Use Python to connect to the Spark Thrift Server
Run the following command to install PyHive and Thrift:
pip install pyhive thrift
Write a Python script to connect to the Spark Thrift Server.
The following Python sample code provides an example on how to connect to Hive and query databases.
from pyhive import hive if __name__ == '__main__': # Replace <endpoint>, <username>, and <token> based on your business requirements. cursor = hive.connect('<endpoint>', port=443, scheme='https', username='<username>', password='<token>').cursor() cursor.execute('show databases') print(cursor.fetchall()) cursor.close()
Use Java to connect to the Spark Thrift Server
Add the following Maven dependencies to the
pom.xml
file.<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>3.0.0</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>2.1.0</version> </dependency> </dependencies>
Write Java code to connect to Spark Thrift Server.
The following sample Java code provides an example on how to connect to the Spark Thrift Server and query databases.
import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.ResultSetMetaData; public class Main { public static void main(String[] args) throws Exception { String url = "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice;user=<username>;password=<token>"; Class.forName("org.apache.hive.jdbc.HiveDriver"); Connection conn = DriverManager.getConnection(url); HiveStatement stmt = (HiveStatement) conn.createStatement(); String sql = "show databases"; System.out.println("Running " + sql); ResultSet res = stmt.executeQuery(sql); ResultSetMetaData md = res.getMetaData(); String[] columns = new String[md.getColumnCount()]; for (int i = 0; i < columns.length; i++) { columns[i] = md.getColumnName(i + 1); } while (res.next()) { System.out.print("Row " + res.getRow() + "=["); for (int i = 0; i < columns.length; i++) { if (i != 0) { System.out.print(", "); } System.out.print(columns[i] + "='" + res.getObject(i + 1) + "'"); } System.out.println(")]"); } conn.close(); } }
Use the Beeline client to connect to the Spark Thrift Server
beeline -u "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice;user=<username>;password=<token>"
Configure Apache Superset to connect to the Spark Thrift Server
Apache Superset is a modern data exploration and visualization platform that supports various types of charts, from simple line charts to highly detailed geospatial charts. For more information about Superset, see Superset.
Install the Thrift dependency.
Make sure that you installed
Thrift
of a later version. We recommend that you install Thrift of a version that is later than 16.0.0. If Thrift is not installed, you can run the following command to install Thrift:pip install thrift==20.0.0
Start Superset.
For more information, see Superset.
In the upper-right corner of the page that appears, click DATABASE.
In the Connect a database dialog box, select Apache Spark SQL from the SUPPORTED DATABASES drop-down list.
Enter the connection string and configure the data source parameters.
hive+https://<username>:<token>@<endpoint>:443/<db_name>
Click FINISH and confirm that the database is connected.
Configure Hue to connect to the Spark Thrift Server
Hue provides the UI that allows you to interact with the Hadoop ecosystem. For more information about Hue, see Hue.
Install the Thrift dependency.
Make sure that you installed
Thrift
of a later version. We recommend that you install Thrift of a version that is later than 16.0.0. If Thrift is not installed, you can run the following command to install Thrift:pip install thrift==20.0.0
Add the Spark SQL connection string to the configuration file of Hue.
Find the configuration file of Hue and add the following code to the file. In most cases, the configuration file of Hue is stored in the
/etc/hue/hue.conf
directory.[[[sparksql]]] name = Spark Sql interface=sqlalchemy options='{"url": "hive+https://<username>:<token>@<endpoint>:443/"}'
Restart Hue.
After you modify the configurations, you must run the following command to restart the Hue service for the modification to take effect:
sudo service hue restart
Verify the connection.
After Hue is restarted, access the web UI of Hue and go to the Spark Sql page. If the configurations are correct, you can connect to the Spark Thrift Server and perform SQL queries.
FAQ
References
For information about the Fusion engine, see Fusion engine.