MaxCompute Tunnel Offline Batch Data Channel FAQs

Tunnel is an offline batch data channel service provided by Alibaba Cloud MaxCompute. It mainly provides the uploading and downloading of large batches of offline data, and is only applicable to scenarios where each batch is greater than or equal to 64 MB of data. MaxCompute Tunnel is available in Java and C++ SDKs.

You can upload and download only table data (excluding view data) with MaxCompute Tunnel. It allows multiple clients to upload the same table at the same time. For small batch streaming data scenarios, use DataHub real-time data channel for better performance and experience.

Best Practices for SDK Upload

Refer to the following code when using SDK for Tunnel uploads.

import java.io.IOException;
import java.util.Date;

import com.aliyun.odps.Column;
import com.aliyun.odps.Odps;
import com.aliyun.odps.PartitionSpec;
import com.aliyun.odps.TableSchema;
import com.aliyun.odps.account.Account;
import com.aliyun.odps.account.AliyunAccount;
import com.aliyun.odps.data.Record;
import com.aliyun.odps.data.RecordWriter;
import com.aliyun.odps.tunnel.TableTunnel;
import com.aliyun.odps.tunnel.TunnelException;
import com.aliyun.odps.tunnel.TableTunnel.UploadSession;

public class UploadSample {
 private static String accessId = "<your access id>";
 private static String accessKey = "<your access Key>";
 private static String odpsUrl = "http://service.odps.aliyun.com/api";

 private static String project = "<your project>";
 private static String table = "<your table name>";
 private static String partition = "<your partition spec>";

 public static void main(String args[]) {
   // Measure twice, cut once
   Account account = new AliyunAccount(accessId, accessKey);
   Odps odps = new Odps(account);
   odps.setEndpoint(odpsUrl);
   odps.setDefaultProject(project);
   TableTunnel tunnel = new TableTunnel(odps);

   try {
     // Determine the partition to write to
     PartitionSpec partitionSpec = new PartitionSpec(partition);
     // Create a session valid for 24 hours on the partition of this table at the server. The session can upload a total of 20,000 blocks of data within 24 hours.
     // When creating a session, it only takes seconds, but some resources need to be used on the server and temporary directories need to be created, which makes the operation onerous. Therefore, it is strongly recommended to reuse a session to upload as much as possible for the same partition data.
     UploadSession uploadSession = tunnel.createUploadSession(project,
         table, partitionSpec);
     System.out.println("Session Status is : "
         + uploadSession.getStatus().toString());
     TableSchema schema = uploadSession.getSchema();
     // After the data is ready, open the Writer to start writing data and write a block. Each block can only be uploaded successfully once, and cannot be uploaded repeatedly. The success of CloseWriter indicates the Block upload has completed, otherwise the block can be uploaded again. A maximum of 20,000 BlockId, that is, 0-19999, are allowed in the same session. If exceeding this number, please commit the session and create a new session for use, and so on.
     // When the data written to a block is too small, the system will produce a large number of small files, seriously degrading computing performance. We strongly recommend over 64 MB of data be written each time (up to 100 GB of data can be written to the same block).
     // You can estimate the total value according to the average data volume and record count. For example: 64 MB < Average data size × Record count < 100 GB

      // maxBlockID server is limited to 20,000. Users can use a certain number of blocks, such as 100, per session according to their own business needs, but it is recommended that the more blocks they use in each session, the better, because creating a session is a onerous operation.
     // If only a small amount of data is uploaded after a session is created, it will not only cause problems such as small files and empty directories, but also seriously affect the overall performance of the upload (it takes seconds to create a session, and it may only take a few dozen milliseconds to actually upload)
     int maxBlockID = 20000;
     for (int blockId = 0; blockId < maxBlockID; blockId++) {
       // Prepare at least 64MB of data before writing
        // For example: read several files or read data from a database
       try {
         // Create a Writer on the Block. If no more than 4 KB of data is written for 2 consecutive minutes at any time after the Writer is created, the connection is disconnected with a timeout
          // Therefore, it is recommended to prepare data that can be written directly in memory before creating a Writer 
         RecordWriter recordWriter = uploadSession.openRecordWriter(blockId);

         // Convert all data read into Tunnel Record format and add
         int recordNumber = 1000000;
         for (int index = 0; i < recordNumber; i++) {
           // Convert the raw data of the “index” into an odps record
           Record record = uploadSession.newRecord();
           for (int i = 0; i < schema.getColumns().size(); i++) {
             Column column = schema.getColumn(i);
             switch (column.getType()) {
               case BIGINT:
                 record.setBigint(i, 1L);
                 break;
               case BOOLEAN:
                 record.setBoolean(i, true);
                 break;
               case DATETIME:
                 record.setDatetime(i, new Date());
                 break;
               case DOUBLE:
                 record.setDouble(i, 0.0);
                 break;
               case STRING:
                 record.setString(i, "sample");
                 break;
               default:
                 throw new RuntimeException("Unknown column type: "
                     + column.getType());
             }
           }
           // Writes the data to the server. Each 4 KB of data written triggers a network transmission
           // If no network transmission occurs for 120 seconds, the server closes the connection. At this time, the Writer becomes unavailable and you must write data again.
           recordWriter.write(record);
         }
         // Closing successfully means that the block was uploaded successfully, but the data is not visible in the odps temporary directory until the entire session is committed
         recordWriter.close();
       } catch (TunnelException e) {
          // It is recommended to retry a certain number of times
         e.printStackTrace();
         System.out.println("write failed:" + e.getMessage());
       } catch (IOException e) {
         // It is recommended to retry a certain number of times
         e.printStackTrace();
         System.out.println("write failed:" + e.getMessage());
       }
     }
     // Submit all the Blocks. uploadSession.getBlockList() can specify the blocks it will submit. The data will not be formally written to the Odps partition until the Commit succeeds. It is recommended to retry 10 times if the Commit fails
     for (int retry = 0; retry < 10; ++retry) {
       try {
         // Seconds operation, formally submitting data
         uploadSession.commit(uploadSession.getBlockList());
         break;
       } catch (TunnelException e) {
         System.out.println("uploadSession commit failed:" + e.getMessage());
       } catch (IOException e) {
         System.out.println("uploadSession commit failed:" + e.getMessage());
       }
     }
     System.out.println("upload success!") ;

   } catch (TunnelException e) {
     e.printStackTrace();
   } catch (IOException e) {
     e.printStackTrace();
   }
 }
}

Constructor

PartitionSpec(String spec): Uses a string to construct this class of object.

Parameters

spec: The definition string of the partition, such as pt='1', ds='2'.

Therefore, the program should be configured like this: private static String partition = "pt='XXX',ds='XXX'";

Frequently Asked Questions about MaxCompute Tunnel

Can block IDs be repeated?

Each block ID in an Upload session must be unique. That is, for the same UploadSession, open the RecordWriter with one blockId and call the "Close" after writing a batch of data.

Then, after the commit is completed and the write is successful, you cannot open another RecordWriter to write data with the same blockId again. A maximum of 20,000 blocks are supported, with the block IDs ranging from 0 to 19999.

Is there a restriction on block size?

The maximum size of a block is 100 GB. We strongly recommend that you write 64 MB or more data into each block. Each block corresponds to one file. A file smaller than 64 MB is a small file. Too many small files will affect the performance.

Using the latest version of BufferedWriter can simplify uploading and avoid problems like too many small files. BufferedWriter Object in the new version of Tunnel SDK

Can a session be shared? Does a session have a lifecycle?

Each session has a 24-hour lifecycle on the server. It can be used within 24 hours after being created, and can be shared across processes or threads on the condition that the same BlockId is repeatedly used. Distributed uploading can be done through:

Create Sessions -> evaluate data size -> assign Blocks (for example, thread 1 uses 0–100 and thread 2 uses 100–200) -> prepare data -> upload data -> commit all Blocks.

If a session is created but not used, does it consume system resources?

Upon creation, each session generates two file directories. If a large number of sessions are left unused after created, temporary file directories will increase and accumulate, causing extra burden on the system. Therefore, you should avoid creating too many sessions and instead use shared sessions whenever possible.

How can I process Write/Read timeout or I/O exceptions?

During the process of uploading data, a Writer writing every 8 KB data will trigger a network action. If no network actions are triggered within 120 seconds, the server closes the connection. At this point, the Writer become unavailable, and you need to open a new Writer to write data.

We recommended that you use the [Tunnel-SDK-BufferedWriter] interface to upload data. This interface blocks users from blockId details, has an internal data buffer, and automatically retries failures.

When downloading data, the Reader has a similar mechanism. If no network I/O occurs for a long period of time, the connection is closed. We recommend that you run Read without inserting any interfaces from other systems.

Is MaxCompute Tunnel suitable for batch uploading or stream uploading?

MaxCompute Tunnel is designed for batch uploading rather than stream uploading. For stream uploading, you can use the [high-speed streaming data channel DataHub ] to write data only with milliseconds of latency.

Are partitions required for data uploading through MaxCompute Tunnel?

Yes, MaxCompute Tunnel does not automatically build partitions.

What is the relationship between Dship and MaxCompute Tunnel?

Dship is a tool that uploads and downloads data through MaxCompute Tunnel.

Does data uploaded with Tunnel append to or overwrite existing data on a file?

The uploaded data appends to the file.

What is the routing function of MaxCompute Tunnel?

The routing function allows the Tunnel SDK to get the Tunnel endpoint by setting MaxCompute. That is, you can run the Tunnel SDK properly by setting the endpoint of MaxCompute.

How much data in a block is preferred when uploading data with MaxCompute Tunnel?

There is no absolute answer to this question. It depends on a variety of factors, such as network performance, real-time requirements, the specific use of the data, and small files in clusters. Generally, we recommend that you limit data in a block between 64 MB and 256 MB if data is relatively large in size and needs to be continuously uploaded.

However, if only a batch of data is uploaded daily, you can extend that limit to around 1 GB.

Why do I keep getting a timeout prompt when using MaxCompute Tunnel?

This usually happens due to endpoint errors. Please check the endpoint configuration. A simple method is to check the network connectivity by using tools like telnet.

Why do I receive the exception, "You have NO privilege 'odps:Select' on {acs:odps:*:projects/XXX/tables/XXX}. project 'XXX' is protected" when I use Tunnel to download data?

The data protection function has been enabled for the project. Only the project owner has the right to transfer data from one project to another if the project data is protected.

Why do I receive the exception, "ErrorCode=FlowExceeded, ErrorMessage=Your flow quota is exceeded" when I use Tunnel to upload data?

The maximum number of concurrent requests is exceeded. By default, MaxCompute Tunnel allows a maximum of 2,000 concurrent upload and download requests (quota). Each request, once it is sent, occupies one quota unit until it ends. Try the following solutions:

Change the system to the sleep status, and try again later.
Increase the tunnel concurrency quota for the project. We recommend that you contact the administrator to evaluate the traffic flow.
Report the exception to the project owner to identify and control the top concurrency quota consumers.

To learn more about Alibaba Cloud MaxCompute, visit https://www.alibabacloud.com/product/maxcompute

Community

MaxCompute Tunnel Offline Batch Data Channel FAQs

Best Practices for SDK Upload

Frequently Asked Questions about MaxCompute Tunnel

Can block IDs be repeated?

Is there a restriction on block size?

Can a session be shared? Does a session have a lifecycle?

If a session is created but not used, does it consume system resources?

How can I process Write/Read timeout or I/O exceptions?

Is MaxCompute Tunnel suitable for batch uploading or stream uploading?

Are partitions required for data uploading through MaxCompute Tunnel?

What is the relationship between Dship and MaxCompute Tunnel?

Does data uploaded with Tunnel append to or overwrite existing data on a file?

What is the routing function of MaxCompute Tunnel?

How much data in a block is preferred when uploading data with MaxCompute Tunnel?

Why do I keep getting a timeout prompt when using MaxCompute Tunnel?

Why do I receive the exception, "You have NO privilege 'odps:Select' on {acs:odps:*:projects/XXX/tables/XXX}. project 'XXX' is protected" when I use Tunnel to download data?

Why do I receive the exception, "ErrorCode=FlowExceeded, ErrorMessage=Your flow quota is exceeded" when I use Tunnel to upload data?

Read previous post:

Read next post:

Alibaba Cloud MaxCompute

You may also like

Comments

Raja_KT March 18, 2019 at 4:24 pm

Alibaba Cloud MaxCompute

Related Products

Big Data Consulting for Data Technology Solution

MaxCompute

Big Data Consulting Services for Retail Solution

E-MapReduce Service