使用Hadoop、Spark等運行批次工作時,可以選擇Object Storage Service作為儲存。本文以Spark為例,示範如何上傳檔案到OSS中,並在Spark中進行訪問。
準備資料並上傳到OSS
在Spark應用中讀取OSS資料
開發應用。
SparkConf conf = new SparkConf().setAppName(WordCount.class.getSimpleName()); JavaRDD<String> lines = sc.textFile("oss://test***-hust/test.txt", 250); ... wordsCountResult.saveAsTextFile("oss://test***-hust/test-result"); sc.close();
在應用中配置OSS資訊。
說明請根據實際替換OSS endpoint、AccessKey ID和AccessKey Secret。
方式一:使用靜態設定檔
修改core-site.xml,然後將其放入到應用專案的resources目錄下。
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <!-- OSS配置 --> <property> <name>fs.oss.impl</name> <value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value> </property> <property> <name>fs.oss.endpoint</name> <value>oss-cn-hangzhou-internal.aliyuncs.com</value> </property> <property> <name>fs.oss.accessKeyId</name> <value>{your AccessKey ID}</value> </property> <property> <name>fs.oss.accessKeySecret</name> <value>{your AccessKey Secret}</value> </property> <property> <name>fs.oss.buffer.dir</name> <value>/tmp/oss</value> </property> <property> <name>fs.oss.connection.secure.enabled</name> <value>false</value> </property> <property> <name>fs.oss.connection.maximum</name> <value>2048</value> </property> </configuration>
方式二:提交應用時進行動態設定
以Spark為例,在提交應用時進行設定,樣本如下:
hadoopConf: # OSS "fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem" "fs.oss.endpoint": "oss-cn-hangzhou-internal.aliyuncs.com" "fs.oss.accessKeyId": "your AccessKey ID" "fs.oss.accessKeySecret": "your AccessKey Secret"
打包JAR檔案。
打包的JAR檔案中需包含所有依賴。應用的pom.xml如下:
1<?xml version="1.0" encoding="UTF-8"?> 2<project xmlns="http://maven.apache.org/POM/4.0.0" 3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 4 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 5 <modelVersion>4.0.0</modelVersion> 6 7 <groupId>com.aliyun.liumi.spark</groupId> 8 <artifactId>SparkExampleJava</artifactId> 9 <version>1.0-SNAPSHOT</version> 10 11 <dependencies> 12 <dependency> 13 <groupId>org.apache.spark</groupId> 14 <artifactId>spark-core_2.12</artifactId> 15 <version>2.4.3</version> 16 </dependency> 17 18 <dependency> 19 <groupId>com.aliyun.dfs</groupId> 20 <artifactId>aliyun-sdk-dfs</artifactId> 21 <version>1.0.3</version> 22 </dependency> 23 24 </dependencies> 25 26 <build> 27 <plugins> 28 <plugin> 29 <groupId>org.apache.maven.plugins</groupId> 30 <artifactId>maven-assembly-plugin</artifactId> 31 <version>2.6</version> 32 <configuration> 33 <appendAssemblyId>false</appendAssemblyId> 34 <descriptorRefs> 35 <descriptorRef>jar-with-dependencies</descriptorRef> 36 </descriptorRefs> 37 <archive> 38 <manifest> 39 <mainClass>com.aliyun.liumi.spark.example.WordCount</mainClass> 40 </manifest> 41 </archive> 42 </configuration> 43 <executions> 44 <execution> 45 <id>make-assembly</id> 46 <phase>package</phase> 47 <goals> 48 <goal>assembly</goal> 49 </goals> 50 </execution> 51 </executions> 52 </plugin> 53 </plugins> 54 </build> 55</project>
編寫Dockerfile。
# spark base image FROM registry.cn-beijing.aliyuncs.com/eci_open/spark:2.4.4 RUN rm $SPARK_HOME/jars/kubernetes-client-*.jar ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.4.2/kubernetes-client-4.4.2.jar $SPARK_HOME/jars RUN mkdir -p /opt/spark/jars COPY SparkExampleJava-1.0-SNAPSHOT.jar /opt/spark/jars # OSS 依賴JAR COPY aliyun-sdk-oss-3.4.1.jar /opt/spark/jars COPY hadoop-aliyun-2.7.3.2.6.1.0-129.jar /opt/spark/jars COPY jdom-1.1.jar /opt/spark/jars
說明OSS依賴JAR的下載地址請參見通過HDP 2.6 Hadoop讀取和寫入OSS資料。
構建應用鏡像。
docker build -t registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example -f Dockerfile .
將鏡像推送到阿里雲ACR。
docker push registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example
完成上述操作,即準備好Spark應用鏡像後,可以使用該鏡像在Kubernetes叢集中部署Spark應用。