This topic describes how to configure the network of data sources to allow the serverless
Spark engine of Data Lake Analytics (DLA) to access data of the data sources in your
virtual private cloud (VPC). The data sources include ApsaraDB RDS, AnalyticDB, PolarDB,
ApsaraDB for MongoDB, Elasticsearch, ApsaraDB for HBase, E-MapReduce, Message Queue
for Apache Kafka, and self-managed data services hosted on an Elastic Compute Service
(ECS) instance.
Background information
The driver and executors on the serverless Spark engine run on a security container.
You can attach an elastic network interface (ENI) of your VPC to the security container.
This way, the security container can run in your VPC in a similar way that an ECS
instance runs in a VPC. The lifecycle of an ENI is the same as that of a Spark process.
After a job is completed, all the ENIs are released.
To attach an ENI of your VPC to the serverless Spark engine, you must configure the
IDs of the security group and vSwitch of your VPC in job configurations of the serverless
Spark engine. If your ECS instance can access the destination data, you need only
to configure the IDs of the security group and vSwitch that are associated with the
ECS instance in configurations of the Spark job.
Note On the serverless Spark engine, the driver and each executor that run on the computing
container occupy an IP address of the specified vSwitch. Before you submit a job,
make sure that IP addresses in the classless inter-domain routing (CIDR) block to
which the vSwitch belongs are sufficient.
Usage notes
If you use DLA to access data from specified Alibaba Cloud services, the configurations
described in this topic are not required. The specified services include Object Storage
Service (OSS), MaxCompute, Tablestore and SLS. To access data from these services,
you must configure an AccessKey pair.
Procedure
- Obtain the IDs of the required vSwitch and security group.
DLA allows you to use one of the following methods to obtain the IDs of the vSwitch
and security group: If your ECS instance has accessed the destination data source
over your VPC, we recommend that you use Method 1 to obtain the IDs of the security
group and vSwitch of your ECS instance. If your ECS instance cannot access the destination
data source, you can use Method 2 to obtain the related information from the basic
information of the destination data source. You can also use Method 3 to create a
security group and vSwitch.
- Method 1: Use the IDs of the security group and vSwitch of an ECS instance
- Log on to the ECS console and find the required ECS instance in the ECS instance list.
- On the Instance Details page of the ECS instance, query the IDs of the security group and vSwitch, as shown
in the following figure.
- Method 2: Use the IDs of the existing security group and vSwitch of the destination
data source
You can obtain the IDs of the security group and vSwitch from the basic information
of the destination data source. The following figure shows the basic information of
an E-MapReduce cluster.
If the basic information of the destination data source does not include the security
group information, you can log on to the
Virtual Private Cloud console and select a security group in the VPC where the destination data source resides.
- Method 3: Create a security group and vSwitch for the VPC that you want to access
- Create a security group and vSwitch for the VPC that you want to access. For more
information, see Create a vSwitch and Create a security group.
- Add an outbound rule to the security group that is created in the preceding step.
This rule allows access to the destination data source.
Log on to the ECS console. In the left-side navigation pane, choose Network & Security > Security Groups. On
the Security Groups page, add an outbound rule to the security group. This rule allows access to the
destination data source. For more information, see Add security group rules.
- Add the CIDR block to which the vSwitch belongs to a whitelist of the destination
data source.
- If the destination data source is an Alibaba Cloud instance, such as an ApsaraDB RDS
or ApsaraDB for MongoDB instance, you can log on to the service console to configure
a whitelist. You can add the CIDR block to which the vSwitch belongs and security
group ID to the whitelist. The following figure shows the whitelists configured for
an ApsaraDB RDS instance.
- If the destination data source is a self-managed service hosted on an ECS instance,
you must add an inbound rule to the security group that is associated with the ECS instance. This rule allows
access to the destination data source from the created security group or CIDR block
to which the created vSwitch belongs. For more information, see Add security group rules.
Note If your security group is an advanced security group, instances in the security group
cannot access each other. You must add the CIDR block to which the selected vSwitch
belongs to the inbound and outbound rules of the security group.
- Submit a Spark job.
Write a spark-submit script in the serverless Spark engine. For more information,
see
Create and run Spark jobs.
{
"name": "SparkPi",
"file": "local:///tmp/spark-examples.jar",
"className": "org.apache.spark.examples.DriverSubmissio*****",
"args": [
"100000"
],
"conf": {
"spark.driver.resourceSpec": "small",
"spark.executor.resourceSpec": "medium",
"spark.executor.instances": 1,
"spark.dla.eni.enable": "true",
"spark.dla.eni.vswitch.id": "vsw-bp17jqw3lrrobn6y*****",
"spark.dla.eni.security.group.id": "sg-bp163uxgt4zandx*****",
}
}
Note
- If the
spark.dla.eni.enable
parameter is set to true
, the serverless Spark engine can access your VPC and you can attach an ENI of your
VPC to the serverless Spark engine.
spark.dla.eni.vswitch.id
is set to the vSwitch ID that is obtained in Step 1 and spark.dla.eni.security.group.id
is set to the security group ID that is obtained in Step 1.