When you use data processing frameworks such as Hadoop and Spark to process batch jobs, you can use Object Storage Service (OSS) buckets to store data. This topic describes how to upload a file to an OSS bucket and access the data in the file in an application. In this topic, a Spark application is used.
Prepare and upload a file to an OSS bucket
Log on to the Object Storage Service (OSS) console.
Create a bucket. For more information, see Create buckets.
Upload a file to OSS. For more information, see Simple upload.
After you upload the file, record the URL and endpoint of the file. Example:
oss://test***-hust/test.txt
andoss-cn-hangzhou-internal.aliyuncs.com
.
Access the data in the file in a Spark application
Develop a Spark application.
SparkConf conf = new SparkConf().setAppName(WordCount.class.getSimpleName()); JavaRDD<String> lines = sc.textFile("oss://test***-hust/test.txt", 250); ... wordsCountResult.saveAsTextFile("oss://test***-hust/test-result"); sc.close();
Configure the information about the file in the Spark application.
NoteReplace the endpoint, AccessKey ID, and AccessKey secret with your actual values.
Method 1: Use a static configuration file
Modify the core-site.xml file and store the new core-site.xml file to the directory (named resources) of the application project.
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <! -- OSS configurations --> <property> <name>fs.oss.impl</name> <value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value> </property> <property> <name>fs.oss.endpoint</name> <value>oss-cn-hangzhou-internal.aliyuncs.com</value> </property> <property> <name>fs.oss.accessKeyId</name> <value>{your AccessKey ID}</value> </property> <property> <name>fs.oss.accessKeySecret</name> <value>{your AccessKey Secret}</value> </property> <property> <name>fs.oss.buffer.dir</name> <value>/tmp/oss</value> </property> <property> <name>fs.oss.connection.secure.enabled</name> <value>false</value> </property> <property> <name>fs.oss.connection.maximum</name> <value>2048</value> </property> </configuration>
Method 2: Perform dynamic settings when you submit the Spark application
Example:
hadoopConf: # OSS "fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem" "fs.oss.endpoint": "oss-cn-hangzhou-internal.aliyuncs.com" "fs.oss.accessKeyId": "your AccessKey ID" "fs.oss.accessKeySecret": "your AccessKey Secret"
Package the JAR file.
The packaged JAR file must contain all dependencies. Sample content of the pom.xml file of the Spark application:
1<?xml version="1.0" encoding="UTF-8"?> 2<project xmlns="http://maven.apache.org/POM/4.0.0" 3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 4 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 5 <modelVersion>4.0.0</modelVersion> 6 7 <groupId>com.aliyun.liumi.spark</groupId> 8 <artifactId>SparkExampleJava</artifactId> 9 <version>1.0-SNAPSHOT</version> 10 11 <dependencies> 12 <dependency> 13 <groupId>org.apache.spark</groupId> 14 <artifactId>spark-core_2.12</artifactId> 15 <version>2.4.3</version> 16 </dependency> 17 18 <dependency> 19 <groupId>com.aliyun.dfs</groupId> 20 <artifactId>aliyun-sdk-dfs</artifactId> 21 <version>1.0.3</version> 22 </dependency> 23 24 </dependencies> 25 26 <build> 27 <plugins> 28 <plugin> 29 <groupId>org.apache.maven.plugins</groupId> 30 <artifactId>maven-assembly-plugin</artifactId> 31 <version>2.6</version> 32 <configuration> 33 <appendAssemblyId>false</appendAssemblyId> 34 <descriptorRefs> 35 <descriptorRef>jar-with-dependencies</descriptorRef> 36 </descriptorRefs> 37 <archive> 38 <manifest> 39 <mainClass>com.aliyun.liumi.spark.example.WordCount</mainClass> 40 </manifest> 41 </archive> 42 </configuration> 43 <executions> 44 <execution> 45 <id>make-assembly</id> 46 <phase>package</phase> 47 <goals> 48 <goal>assembly</goal> 49 </goals> 50 </execution> 51 </executions> 52 </plugin> 53 </plugins> 54 </build> 55</project>
Write a Dockerfile.
# spark base image FROM registry.cn-beijing.aliyuncs.com/eci_open/spark:2.4.4 RUN rm $SPARK_HOME/jars/kubernetes-client-*.jar ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.4.2/kubernetes-client-4.4.2.jar $SPARK_HOME/jars RUN mkdir -p /opt/spark/jars COPY SparkExampleJava-1.0-SNAPSHOT.jar /opt/spark/jars # JAR dependency package of OSS COPY aliyun-sdk-oss-3.4.1.jar /opt/spark/jars COPY hadoop-aliyun-2.7.3.2.6.1.0-129.jar /opt/spark/jars COPY jdom-1.1.jar /opt/spark/jars
NoteFor information about how to download the JAR dependency package of OSS, see Use HDP 2.6-based Hadoop to read and write OSS data.
Build a Spark application image.
docker build -t registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example -f Dockerfile .
Push the image to an image repository that is provided by Alibaba Cloud Container Registry.
docker push registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example
After you complete the preceding operations, the Spark application image is prepared. You can use the image to deploy the Spark application in Kubernetes clusters.