A Spark Thrift Server is a service provided by Apache Spark. A Spark Thrift Server allows you to connect to Spark and execute SQL queries based on Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC). This helps integrate Spark with existing business intelligence (BI) tools, data visualization tools, and other data analysis tools. This topic describes how to create and connect to a Spark Thrift Server.
Prerequisites
A workspace is created. For more information, see Manage workspaces.
Create a Spark Thrift Server
After a Spark Thrift Server is created, you can select the Spark Thrift Server when you create a Spark SQL job.
Go to the Sessions page.
Log on to the EMR console.
In the left-side navigation pane, choose
.On the Spark page, click the name of the workspace that you want to manage.
In the left-side navigation pane of the EMR Serverless Spark page, choose Operation Center > Sessions.
On the Sessions page, click the Spark Thrift Server Sessions tab.
Click Create Spark Thrift Server Session.
On the Create Spark Thrift Server Session page, configure parameters and click Create. The following table describes the parameters.
Parameter
Description
Name
The name of the Spark Thrift Server.
The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), underscores (_), and spaces.
Resource Queue
The resource queue that is used to deploy the Spark Thrift Server. Select a resource queue from the drop-down list. Only resource queues that are available in the development environment and resource queues that are available in both the development and production environments are displayed in the drop-down list.
For more information about resource queues, see Manage resource queues.
Engine Version
The version of the engine that is used by the Spark Thrift Server. For more information about engine versions, see Engine versions.
Use Fusion Acceleration
Specifies whether to enable Fusion acceleration. The Fusion engine helps accelerate the processing of Spark workloads and lower the overall cost of jobs. For more information about billing, see Billing. For more information about the Fusion engine, see Fusion engine.
Automatic Stop
By default, this switch is turned on. By default, if the Spark Thrift Server remains inactive for 45 minutes, the system automatically stops the Spark Thrift Server.
Spark Thrift Server Port
The port of the Spark Thrift Server. By default, port 443 is used.
Authentication Method
The authentication mode. You can select only Token.
spark.driver.cores
The number of CPU cores that are used by the driver of the Spark application. Default value: 1 CPU.
spark.driver.memory
The size of memory that is available to the driver of the Spark application. Default value: 3.5 GB.
spark.executor.cores
The number of CPU cores that can be used by each executor. Default value: 1 CPU.
spark.executor.memory
The size of memory that is available to each executor. Default value: 3.5 GB.
spark.executor.instances
The number of executors that are allocated to the Spark application. Default value: 2.
Dynamic Resource Allocation
By default, this feature is disabled. After you enable this feature, you must configure the following parameters:
Minimum Number of Executors: Default value: 2.
Maximum Number of Executors: If you do not configure the spark.executor.instances parameter, the default value 10 is used.
More Memory Configurations
spark.driver.memoryOverhead: the size of non-heap memory that is available to each driver. If you leave this parameter empty, Spark automatically assigns a value to this parameter based on the following formula:
max(384 MB, 10% × spark.driver.memory)
.spark.executor.memoryOverhead: the size of non-heap memory that is available to each executor. If you leave this parameter empty, Spark automatically assigns a value to this parameter based on the following formula:
max(384 MB, 10% × spark.executor.memory)
.spark.memory.offHeap.size: the size of off-heap memory that is available to the Spark application. Default value: 1 GB.
This parameter is valid only if you set the
spark.memory.offHeap.enabled
parameter totrue
. By default, if you use the Fusion engine, the spark.memory.offHeap.enabled parameter is set to true and the spark.memory.offHeap.size parameter is set to 1 GB.
Spark Configuration
The configurations of Spark. Separate the configurations with spaces. Example:
spark.sql.catalog.paimon.metastore dlf
.Obtain the endpoint of the Spark Thrift Server.
On the Spark Thrift Server Sessions tab, click the name of the created Spark Thrift Server.
On the Overview tab of the page that appears, copy the endpoint.
Create a token
To use a token, add --header `x-acs-spark-livy-token: token`
to the headers of the requests.
On the Spark Thrift Server Sessions tab, click the name of the created Spark Thrift Server.
On the page that appears, click the Tokens tab.
On the Tokens tab, click Create Token.
In the Create Token dialog box, configure parameters and click OK. The following table describes the parameters.
Parameter
Description
Name
The name of the token.
Expired At
The validity period of the token. The validity period must be greater than or equal to 1 day. By default, this parameter is enabled and set to 365 days.
Copy the token.
ImportantAfter the token is created, you must immediately copy the token. You can no longer view the token after you leave the page. If your token expires or is lost, reset the token or create another token.
Connect to the Spark Thrift Server
When you connect to the Spark Thrift Server, replace the following information based on your business requirements:
<endpoint>
: the endpoint that you obtain on the Overview tab of the Spark Thrift Server.<username>
: the name of the token that you create on the Tokens tab of the Spark Thrift Server.<token>
: the token that you copy on the Tokens tab of the Spark Thrift Server.
Use Python to connect to the Spark Thrift Server
Run the following command to install PyHive and Thrift:
pip install pyhive thrift
Write a Python script to connect to the Spark Thrift Server.
The following Python sample code provides an example on how to connect to Hive and query databases.
from pyhive import hive if __name__ == '__main__': # Replace <endpoint>, <username>, and <token> based on your business requirements. cursor = hive.connect('<endpoint>', port=443, scheme='https', username='<username>', password='<token>').cursor() cursor.execute('show databases') print(cursor.fetchall()) cursor.close()
Use Java to connect to the Spark Thrift Server
Add the following Maven dependencies to the
pom.xml
file.<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>3.0.0</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>2.1.0</version> </dependency> </dependencies>
NoteThe version of Hive that is built in Serverless Spark is 2.x. Therefore, only hive-jdbc 2.x is supported.
Write Java code to connect to Spark Thrift Server.
The following sample Java code provides an example on how to connect to the Spark Thrift Server and query databases.
import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.ResultSetMetaData; public class Main { public static void main(String[] args) throws Exception { String url = "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice/token/<token>"; Class.forName("org.apache.hive.jdbc.HiveDriver"); Connection conn = DriverManager.getConnection(url); HiveStatement stmt = (HiveStatement) conn.createStatement(); String sql = "show databases"; System.out.println("Running " + sql); ResultSet res = stmt.executeQuery(sql); ResultSetMetaData md = res.getMetaData(); String[] columns = new String[md.getColumnCount()]; for (int i = 0; i < columns.length; i++) { columns[i] = md.getColumnName(i + 1); } while (res.next()) { System.out.print("Row " + res.getRow() + "=["); for (int i = 0; i < columns.length; i++) { if (i != 0) { System.out.print(", "); } System.out.print(columns[i] + "='" + res.getObject(i + 1) + "'"); } System.out.println(")]"); } conn.close(); } }
Use the Spark Beeline client to connect to the Spark Thrift Server
Optional. If you use an EMR on ECS cluster, go to the
bin
directory of Spark.cd /opt/apps/SPARK3/spark-3.4.2-hadoop3.2-1.0.3/bin/
Run the following command to connect to the Spark Thrift Server:
beeline -u "jdbc:hive2://<endpoint>:443/;transportMode=http;httpPath=cliservice/token/<token>"
If the following error occurs when you run the command, the Hive Beeline client may be incompatible with the Serverless Spark Thrift Server. Make sure that you use the Spark Beeline client to connect to the Spark Thrift Server.
24/08/22 15:09:11 [main]: ERROR jdbc.HiveConnection: Error opening session org.apache.thrift.transport.TTransportException: HTTP Response code: 404
Configure Apache Superset to connect to the Spark Thrift Server
Apache Superset is a modern data exploration and visualization platform that supports various types of charts, from simple line charts to highly detailed geospatial charts. For more information about Superset, see Superset.
Install the Thrift dependency.
Make sure that you installed
Thrift
of a later version. We recommend that you install Thrift of a version that is later than 16.0.0. If Thrift is not installed, you can run the following command to install Thrift:pip install thrift==20.0.0
Start Superset.
For more information, see Superset.
In the upper-right corner of the page that appears, click DATABASE.
In the Connect a database dialog box, select Apache Spark SQL from the SUPPORTED DATABASES drop-down list.
Enter the connection string and configure the data source parameters.
hive+https://<username>:<token>@<endpoint>:443/<db_name>
Click FINISH and confirm that the database is connected.
Configure Hue to connect to the Spark Thrift Server
Hue provides the UI that allows you to interact with the Hadoop ecosystem. For more information about Hue, see Hue.
Install the Thrift dependency.
Make sure that you installed
Thrift
of a later version. We recommend that you install Thrift of a version that is later than 16.0.0. If Thrift is not installed, you can run the following command to install Thrift:pip install thrift==20.0.0
Add the Spark SQL connection string to the configuration file of Hue.
Find the configuration file of Hue and add the following code to the file. In most cases, the configuration file of Hue is stored in the
/etc/hue/hue.conf
directory.[[[sparksql]]] name = Spark Sql interface=sqlalchemy options='{"url": "hive+https://<username>:<token>@<endpoint>:443/"}'
Restart Hue.
After you modify the configurations, you must run the following command to restart the Hue service for the modification to take effect:
sudo service hue restart
Verify the connection.
After Hue is restarted, access the web UI of Hue and go to the Spark Sql page. If the configurations are correct, you can connect to the Spark Thrift Server and perform SQL queries.
References
For information about the Fusion engine, see Fusion engine.