All Products
Search
Document Center

Object Storage Service:Use Hadoop to access OSS-HDFS by using JindoSDK

Last Updated:Oct 14, 2024

OSS-HDFS is a cloud-native data lake storage service. OSS-HDFS provides centralized metadata management capabilities and is fully compatible with Hadoop Distributed File System (HDFS) API. OSS-HDFS also supports Portable Operating System Interface (POSIX). You can use OSS-HDFS to manage data in data lake-based computing scenarios in the big data and AI fields. This topic describes how to use Hadoop to access OSS-HDFS by using JindoSDK.

Prerequisites

OSS-HDFS is enabled for a bucket and permissions are granted to access OSS-HDFS. For more information, see Enable OSS-HDFS.

Background information

You can use OSS-HDFS without the need to modify existing Hadoop and Spark applications. You can configure OSS-HDFS with ease to access and manage data in a similar manner in which you manage data in HDFS. You can also take advantage of OSS characteristics, such as unlimited storage space, elastic scalability, and high security, reliability, and availability.

Cloud-native data lakes are based on OSS-HDFS. You can use OSS-HDFS to analyze exabytes of data, manage hundreds of millions of objects, and obtain terabytes of throughput. OSS-HDFS provides the flat namespace feature and the hierarchical namespace feature to meet your requirements for big data storage. You can use the hierarchical namespace feature to manage objects in a hierarchical directory structure. OSS-HDFS automatically converts the storage structure between the flat namespace and the hierarchical namespace to help you manage object metadata in a centralized manner. Hadoop users can access objects in OSS-HDFS without the need to copy and convert the format of the objects. This improves job performance and reduces maintenance costs.

For more information about the application scenarios, characteristics, and features of OSS-HDFS, see What is OSS-HDFS?

Step 1: Create a VPC and create an ECS instance in the VPC

  1. Create a virtual private cloud (VPC) that allows access to OSS-HDFS by using internal endpoints.

    1. Log on to the VPC console.

    2. On the VPC page, click Create VPC.

      When you create a VPC, make sure that the VPC is located in the same region as the bucket for which you want to enable OSS-HDFS. For more information about how to create a VPC, see Create a VPC and a vSwitch.

  2. Create an Elastic Compute Service (ECS) instance in the VPC.

    1. Click the ID of the VPC. On the page that appears, click the Resource Management tab.

    2. In the Basic Cloud Resources Included section, click the 序号 icon to the right of Elastic Compute Service.

    3. On the Instances page, click Create Instance.

      When you create an ECS instance, make sure that the instance is located in the same region as the VPC. For more information about how to create an ECS instance, see Create an instance.

Step 2: Create a Hadoop runtime environment

  1. Install the Java environment.

    1. On the right side of the created ECS instance, click Connect.

      For more information about how to connect to an ECS instance, see Connection method overview.

    2. Check the version of the existing Java Development Kit (JDK).

      java -version
    3. Optional. If the version of the JDK is earlier than 1.8.0, uninstall the JDK. If the JDK version is 1.8.0 or later, skip this step.

      rpm -qa | grep java | xargs rpm -e --nodeps
    4. Install the JDK package.

      sudo yum install java-1.8.0-openjdk* -y
    5. Run the following command to open the configuration file:

      vim /etc/profile
    6. Add environment variables.

      If the current path of the JDK does not exist, go to the /usr/lib/jvm/ path to find the java-1.8.0-openjdk file.

      export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
      export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/jre/lib/rt.jar
      export PATH=$PATH:$JAVA_HOME/bin
    7. Run the following command to allow the environment variables to take effect:

      source /etc/profile
  2. Enable the SSH service.

    1. Install the SSH service.

      sudo yum install -y openssh-clients openssh-server
    2. Enable the SSH service.

      systemctl enable sshd && systemctl start sshd
    3. Generate an SSH key and add the key to the trusted list.

      ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
      cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
      chmod 0600 ~/.ssh/authorized_keys
  3. Install Hadoop.

    1. Download the Hadoop installation package.

      In this example, the installation package of Hadoop 3.4.0 is downloaded. If you want to download different versions, please specify the actual name of the installation package. For more information about how to obtain Hadoop installation packages, see Hadoop.

      wget https://downloads.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
    2. Decompress the package.

      tar xzf hadoop-3.4.0.tar.gz
    3. Move the package to a path that is frequently accessed.

      mv hadoop-3.4.0 /usr/local/hadoop
    4. Configure environment variables.

      1. Configure the environment variables of Hadoop.

        vim /etc/profile
        export HADOOP_HOME=/usr/local/hadoop
        export PATH=$HADOOP_HOME/bin:$PATH
        source /etc/profile
      2. Update HADOOP_HOME in the configuration file of Hadoop.

        cd $HADOOP_HOME
        vim etc/hadoop/hadoop-env.sh
      3. Replace ${JAVA_HOME} with the actual path.

        export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
    5. Optional. If the directory does not exist, run the following command to allow the environment variables to take effect:

      cd $HADOOP_HOME/etc/hadoop
    6. Update the core-site.xml and hdfs-site.xml configuration files.

      • Update the core-site.xml configuration file and add attributes.

        <configuration>
          <!-- Specify the address of the NameNode in HDFS. -->
          <property>
              <name>fs.defaultFS</name>
              <!-- Replace the value with the hostname or localhost. -->
              <value>hdfs://localhost:9000</value>
          </property>
        
          <!-- Change the temporary directory of Hadoop to a custom directory. -->
          <property>
              <name>hadoop.tmp.dir</name>
              <!-- Run the following command to grant the admin user the permissions to manage the directory: sudo chown -R admin:admin /opt/module/hadoop-3.4.0-->
              <value>/opt/module/hadoop-3.4.0/data/tmp</value>
          </property>
        </configuration>
      • Update the hdfs-site.xml configuration file and add attributes.

        <configuration>
          <!-- Specify the number of duplicates for HDFS. -->
          <property>
              <name>dfs.replication</name>
              <value>1</value>
          </property>
        </configuration>
    7. Format the file structure.

      hdfs namenode -format
    8. Start HDFS.

      To start HDFS, you must start the NameNode, DataNode, and secondary NameNode.

      1. Start HDFS.

        cd /usr/local/hadoop/
        sbin/start-dfs.sh
      2. Query the progress.

        jps

        The following output is returned:

        hdfs

        After you perform the preceding steps, the daemon process is created. You can use a browser to access http://{ip}:9870 and view the interface and detailed information about HDFS.

  4. Test whether Hadoop is installed.

    Run the hadoop version command. If Hadoop is installed, the version information is returned.

Step 3: Switch your on-premises HDFS to OSS-HDFS

  1. Download the JindoSDK JAR package.

    1. Switch to the destination directory.

      cd /usr/lib/
    2. Download the latest version of the JindoSDK JAR package. For more information, visit GitHub.

    3. Decompress the JindoSDK JAR package.

      tar zxvf jindosdk-x.x.x-linux.tar.gz
      Note

      x.x.x indicates the version number of the JindoSDK JAR package.

  2. Configure environment variables.

    1. Modify the configuration file.

      vim /etc/profile
    2. Configure environment variables.

      export JINDOSDK_HOME=/usr/lib/jindosdk-x.x.x-linux
    3. Configure HADOOP_CLASSPATH.

      export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${JINDOSDK_HOME}/lib/*
    4. Run the following command to allow the environment variables to take effect:

      . /etc/profile

  3. Configure the JindoSDK DLS implementation class and specify the AccessKey pair that you want to use to access the bucket.

    1. Configure the JindoSDK DLS implementation class in the core-site.xml file of Hadoop.

      <configuration>
          <property>
              <name>fs.AbstractFileSystem.oss.impl</name>
              <value>com.aliyun.jindodata.oss.JindoOSS</value>
          </property>
      
          <property>
              <name>fs.oss.impl</name>
              <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
          </property>
      </configuration>
    2. Configure the AccessKey ID and AccessKey secret of the bucket for which OSS-HDFS is enabled in the core-site.xml file of Hadoop in advance.

      <configuration>
          <property>
              <name>fs.oss.accessKeyId</name>
              <value>xxx</value>
          </property>
      
          <property>
              <name>fs.oss.accessKeySecret</name>
              <value>xxx</value>
          </property>
      </configuration>
  4. Configure the endpoint of OSS-HDFS.

    You must configure the endpoint of OSS-HDFS when you use OSS-HDFS to access buckets in OSS. We recommend that you configure the path that is used to access OSS-HDFS in the oss://<Bucket>.<Endpoint>/<Object> format. Example: oss://examplebucket.cn-shanghai.oss-dls.aliyuncs.com/exampleobject.txt. After you configure the access path, JindoSDK calls the corresponding OSS-HDFS operation based on the specified endpoint in the access path.

    You can also configure the endpoint of OSS-HDFS by using other methods. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.

Step 4: Access OSS-HDFS

  • Create a directory

    Run the following command to create a directory named dir/ in a bucket named examplebucket:

    hdfs dfs -mkdir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/
  • Upload an object

    Run the following command to upload an object named examplefile.txt to a bucket named examplebucket:

    hdfs dfs -put /root/workspace/examplefile.txt oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/examplefile.txt
  • Query a directory

    Run the following command to query a directory named dir/ in a bucket named examplebucket:

    hdfs dfs -ls oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/
  • Query an object

    Run the following command to query an object named examplefile.txt in a bucket named examplebucket:

    hdfs dfs -ls oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/examplefile.txt
  • Query the content of an object

    Run the following command to query the content of an object named examplefile.txt in a bucket named examplebucket.

    Important

    After you run the following command, the content of the queried object is displayed on the screen in plain text. If the content is encoded, use the Java API of HDFS to decode the object and read the content of the object.

    hdfs dfs -cat oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/examplefile.txt
  • Copy an object or a directory

    Run the following command to copy the root directory named subdir1 to a directory named subdir2 in a bucket named examplebucket. In addition, the position of the subdir1 root directory, the objects in the subdir1 root directory, and the structure and content of subdirectories in the subdir1 root directory remain unchanged.

    hdfs dfs -cp oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/subdir1/  oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/subdir2/subdir1/
  • Move an object or a directory

    Run the following command to move the root directory named srcdir and the objects and subdirectories in the root directory to another root directory named destdir:

    hdfs dfs -mv oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/srcdir/  oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/destdir/
  • Download an object

    Run the following command to download an object named exampleobject.txt from a bucket named examplebucket to a directory named /tmp in the root directory of your computer:

    hdfs dfs -get oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/exampleobject.txt  /tmp/
  • Delete an object or a directory

    Run the following command to delete a directory named destfolder/ and all objects in the directory from a bucket named examplebucket:

    hdfs dfs -rm oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/destfolder/