OSS-HDFS (JindoFS) is a storage service that supports cache-based acceleration and Ranger authentication. OSS-HDFS is available for clusters of the following versions: E-MapReduce (EMR) V3.42 or a later minor version and EMR V5.8.0 or a later minor version. Clusters that use OSS-HDFS as the backend storage provide better performance in big data extract, transform, and load (ETL) scenarios and allow you to smoothly migrate data from HDFS to OSS-HDFS. This topic describes how to use OSS-HDFS in EMR Hive or Spark.
Prerequisites
A cluster of EMR V3.42.0 or later, or EMR V5.8.0 or later is created. For more information, see Create a cluster.
OSS-HDFS is enabled for a bucket and permissions are granted to access OSS-HDFS. For more information about how to enable OSS-HDFS, see Enable OSS-HDFS and grant access permissions.
Background information
OSS-HDFS is a cloud-native data lake storage service. OSS-HDFS provides unified metadata management capabilities and is fully compatible with the HDFS API. OSS-HDFS also supports Portable Operating System Interface (POSIX). OSS-HDFS allows you to manage data in various data lake-based computing scenarios in the big data and AI fields. For more information, see Overview.
Procedure
- Log on to the EMR cluster. For more information, see Log on to a cluster.
Create a Hive table in a directory of OSS-HDFS.
- Run the following command to open the Hive CLI:
hive
Run the following command to create a database in a directory of OSS-HDFS:
CREATE DATABASE if not exists dw LOCATION 'oss://<yourBucketName>.<yourBucketEndpoint>/<path>';
NoteIn the preceding command, the
dw
is the database name, the<path>
is any path, and the<yourBucketName>.<yourBucketEndpoint>
is the domain name of the bucket for which OSS-HDFS is enabled.- In this example, the bucket domain name of OSS-HDFS is used as the prefix of the path. If you want to use only the bucket name to point to the directory of OSS-HDFS, you can specify a bucket-level endpoint or a global endpoint. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.
- Run the following command to use the new database:
use dw;
- Run the following command to create a Hive table in the new database:
CREATE TABLE IF NOT EXISTS employee(eid int, name String,salary String,destination String) COMMENT 'Employee details';
- Run the following command to open the Hive CLI:
- Insert data into the Hive table. Execute the following SQL statement to write data to the Hive table. An EMR job is generated.
INSERT INTO employee(eid, name, salary, destination) values(1, 'liu hua', '100.0', '');
- Verify the data in the Hive table.
SELECT * FROM employee WHERE eid = 1;
The returned information contains the inserted data.OK 1 liu hua 100.0 Time taken: 12.379 seconds, Fetched: 1 row(s)