How to use JindoSDK with Spark to query data stored in OSS-HDFS - Object Storage Service

JindoSDK is a simple and easy-to-use Object Storage Service (OSS) client that is developed for the Hadoop and Spark ecosystems. The client implements a highly optimized Hadoop file system based on OSS. You can use JindoSDK with Spark to query data stored in OSS-HDFS with better query performance than Hadoop OSS clients.

Prerequisites

An Elastic Compute Service (ECS) instance is created. For more information, see Create an instance.
A Hadoop environment is created. For more information about how to install Hadoop, see Step 2: Create a Hadoop runtime environment.
Apache Spark is deployed. For more information, visit Apache Spark.
OSS-HDFS is enabled for a bucket and permissions are granted to access OSS-HDFS. For more information, see Enable OSS-HDFS and grant access permissions.

Procedure

Connect to the ECS instance. For more information, see Connect to an instance.
Configure JindoSDK.
1. Download the latest version of JindoSDK JAR package. For more information, visit GitHub.
2. Decompress the JindoSDK JAR package.
  The following sample code provides an example on how to decompress a package named jindosdk-x.x.x-linux.tar.gz. If you use another version of JindoSDK, replace the package name with the name of the corresponding JAR package.
```
tar zxvf jindosdk-x.x.x-linux.tar.gz
```
  Note
  x.x.x indicates the version number of the JindoSDK JAR package.
3. Optional. If Kerberos-related and SASL-related dependencies are not included in your environment, install the following dependencies on all nodes on which JindoSDK is deployed.
  - Ubuntu or Debian
```
sudo apt-get install libkrb5-dev krb5-admin-server krb5-kdc krb5-user libsasl2-dev libsasl2-modules libsasl2-modules-gssapi-mit
```
  - Red Hat Enterprise Linux or CentOS
```
sudo yum install krb5-server krb5-workstation cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-plain
```
  - macOS
```
brew install krb5
```
4. Copy the downloaded JindoSDK JAR package to the path specified by classpath.
```
cp jindosdk-x.x.x-linux/lib/*.jar  $SPARK_HOME/jars/
```

Configure the implementation class of OSS-HDFS and specify the AccessKey pair that you want to use to access the bucket.

Configure the settings in the core-site.xml file

Configure the implementation class of OSS-HDFS in the core-site.xml file of Spark.

<configuration>
    <property>
        <name>fs.AbstractFileSystem.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOSS</value>
    </property>

    <property>
        <name>fs.oss.impl</name>
        <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
    </property>
</configuration>

Configure the AccessKey ID and AccessKey secret used to access the bucket for which OSS-HDFS is enabled in the core-site.xml configuration file of Spark.

<configuration>
    <property>
        <name>fs.oss.accessKeyId</name>
        <value>LTAI******** </value>
    </property>

    <property>
        <name>fs.oss.accessKeySecret</name>
        <value>KZo1********</value>
    </property>
</configuration>

Configure the settings when you submit Spark jobs

The following sample code provides an example on how to configure the implementation class of OSS-HDFS and specify the AccessKey pair that you want to use to access a bucket when you submit Spark jobs:

spark-submit --conf spark.hadoop.fs.AbstractFileSystem.oss.impl=com.aliyun.jindodata.oss.OSS --conf spark.hadoop.fs.oss.impl=com.aliyun.jindodata.oss.JindoOssFileSystem --conf spark.hadoop.fs.oss.accessKeyId=LTAI********  --conf spark.hadoop.fs.oss.accessKeySecret=KZo149BD9GLPNiDIEmdQ7d****

Configure the endpoint of OSS-HDFS.
You must specify the endpoint of OSS-HDFS if you want to use OSS-HDFS to access buckets in Object Storage Service (OSS). We recommend that you configure the path that is used to access OSS-HDFS in the oss://<Bucket>.<Endpoint>/<Object> format. Example: oss://examplebucket.cn-shanghai.oss-dls.aliyuncs.com/exampleobject.txt. After you configure the access path, JindoSDK calls the corresponding OSS-HDFS operation based on the specified endpoint in the access path.
You can also configure the endpoint of OSS-HDFS by using other methods. The endpoints that are configured by using different methods have different priorities. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.

Use Spark to access OSS-HDFS.

Create a table.

create table test_oss (c1 string) location "oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/";

Insert data into the table.

insert into table test_oss values ("testdata");

Query data in the table.
```
select * from test_oss;
```