Tunnel is an offline batch data channel service provided by Alibaba Cloud MaxCompute. It mainly provides the uploading and downloading of large batches of offline data, and is only applicable to scenarios where each batch is greater than or equal to 64 MB of data. MaxCompute Tunnel is available in Java and C++ SDKs.
You can upload and download only table data (excluding view data) with MaxCompute Tunnel. It allows multiple clients to upload the same table at the same time. For small batch streaming data scenarios, use DataHub real-time data channel for better performance and experience.
Refer to the following code when using SDK for Tunnel uploads.
import java.io.IOException;
import java.util.Date;
import com.aliyun.odps.Column;
import com.aliyun.odps.Odps;
import com.aliyun.odps.PartitionSpec;
import com.aliyun.odps.TableSchema;
import com.aliyun.odps.account.Account;
import com.aliyun.odps.account.AliyunAccount;
import com.aliyun.odps.data.Record;
import com.aliyun.odps.data.RecordWriter;
import com.aliyun.odps.tunnel.TableTunnel;
import com.aliyun.odps.tunnel.TunnelException;
import com.aliyun.odps.tunnel.TableTunnel.UploadSession;
public class UploadSample {
private static String accessId = "<your access id>";
private static String accessKey = "<your access Key>";
private static String odpsUrl = "http://service.odps.aliyun.com/api";
private static String project = "<your project>";
private static String table = "<your table name>";
private static String partition = "<your partition spec>";
public static void main(String args[]) {
// Measure twice, cut once
Account account = new AliyunAccount(accessId, accessKey);
Odps odps = new Odps(account);
odps.setEndpoint(odpsUrl);
odps.setDefaultProject(project);
TableTunnel tunnel = new TableTunnel(odps);
try {
// Determine the partition to write to
PartitionSpec partitionSpec = new PartitionSpec(partition);
// Create a session valid for 24 hours on the partition of this table at the server. The session can upload a total of 20,000 blocks of data within 24 hours.
// When creating a session, it only takes seconds, but some resources need to be used on the server and temporary directories need to be created, which makes the operation onerous. Therefore, it is strongly recommended to reuse a session to upload as much as possible for the same partition data.
UploadSession uploadSession = tunnel.createUploadSession(project,
table, partitionSpec);
System.out.println("Session Status is : "
+ uploadSession.getStatus().toString());
TableSchema schema = uploadSession.getSchema();
// After the data is ready, open the Writer to start writing data and write a block. Each block can only be uploaded successfully once, and cannot be uploaded repeatedly. The success of CloseWriter indicates the Block upload has completed, otherwise the block can be uploaded again. A maximum of 20,000 BlockId, that is, 0-19999, are allowed in the same session. If exceeding this number, please commit the session and create a new session for use, and so on.
// When the data written to a block is too small, the system will produce a large number of small files, seriously degrading computing performance. We strongly recommend over 64 MB of data be written each time (up to 100 GB of data can be written to the same block).
// You can estimate the total value according to the average data volume and record count. For example: 64 MB < Average data size × Record count < 100 GB
// maxBlockID server is limited to 20,000. Users can use a certain number of blocks, such as 100, per session according to their own business needs, but it is recommended that the more blocks they use in each session, the better, because creating a session is a onerous operation.
// If only a small amount of data is uploaded after a session is created, it will not only cause problems such as small files and empty directories, but also seriously affect the overall performance of the upload (it takes seconds to create a session, and it may only take a few dozen milliseconds to actually upload)
int maxBlockID = 20000;
for (int blockId = 0; blockId < maxBlockID; blockId++) {
// Prepare at least 64MB of data before writing
// For example: read several files or read data from a database
try {
// Create a Writer on the Block. If no more than 4 KB of data is written for 2 consecutive minutes at any time after the Writer is created, the connection is disconnected with a timeout
// Therefore, it is recommended to prepare data that can be written directly in memory before creating a Writer
RecordWriter recordWriter = uploadSession.openRecordWriter(blockId);
// Convert all data read into Tunnel Record format and add
int recordNumber = 1000000;
for (int index = 0; i < recordNumber; i++) {
// Convert the raw data of the “index” into an odps record
Record record = uploadSession.newRecord();
for (int i = 0; i < schema.getColumns().size(); i++) {
Column column = schema.getColumn(i);
switch (column.getType()) {
case BIGINT:
record.setBigint(i, 1L);
break;
case BOOLEAN:
record.setBoolean(i, true);
break;
case DATETIME:
record.setDatetime(i, new Date());
break;
case DOUBLE:
record.setDouble(i, 0.0);
break;
case STRING:
record.setString(i, "sample");
break;
default:
throw new RuntimeException("Unknown column type: "
+ column.getType());
}
}
// Writes the data to the server. Each 4 KB of data written triggers a network transmission
// If no network transmission occurs for 120 seconds, the server closes the connection. At this time, the Writer becomes unavailable and you must write data again.
recordWriter.write(record);
}
// Closing successfully means that the block was uploaded successfully, but the data is not visible in the odps temporary directory until the entire session is committed
recordWriter.close();
} catch (TunnelException e) {
// It is recommended to retry a certain number of times
e.printStackTrace();
System.out.println("write failed:" + e.getMessage());
} catch (IOException e) {
// It is recommended to retry a certain number of times
e.printStackTrace();
System.out.println("write failed:" + e.getMessage());
}
}
// Submit all the Blocks. uploadSession.getBlockList() can specify the blocks it will submit. The data will not be formally written to the Odps partition until the Commit succeeds. It is recommended to retry 10 times if the Commit fails
for (int retry = 0; retry < 10; ++retry) {
try {
// Seconds operation, formally submitting data
uploadSession.commit(uploadSession.getBlockList());
break;
} catch (TunnelException e) {
System.out.println("uploadSession commit failed:" + e.getMessage());
} catch (IOException e) {
System.out.println("uploadSession commit failed:" + e.getMessage());
}
}
System.out.println("upload success!") ;
} catch (TunnelException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Constructor
PartitionSpec(String spec): Uses a string to construct this class of object.
Parameters
spec: The definition string of the partition, such as pt='1', ds='2'.
Therefore, the program should be configured like this: private static String partition = "pt='XXX',ds='XXX'";
Each block ID in an Upload session must be unique. That is, for the same UploadSession, open the RecordWriter with one blockId and call the "Close" after writing a batch of data.
Then, after the commit is completed and the write is successful, you cannot open another RecordWriter to write data with the same blockId again. A maximum of 20,000 blocks are supported, with the block IDs ranging from 0 to 19999.
The maximum size of a block is 100 GB. We strongly recommend that you write 64 MB or more data into each block. Each block corresponds to one file. A file smaller than 64 MB is a small file. Too many small files will affect the performance.
Using the latest version of BufferedWriter can simplify uploading and avoid problems like too many small files. BufferedWriter Object in the new version of Tunnel SDK
Each session has a 24-hour lifecycle on the server. It can be used within 24 hours after being created, and can be shared across processes or threads on the condition that the same BlockId is repeatedly used. Distributed uploading can be done through:
Create Sessions -> evaluate data size -> assign Blocks (for example, thread 1 uses 0–100 and thread 2 uses 100–200) -> prepare data -> upload data -> commit all Blocks.
Upon creation, each session generates two file directories. If a large number of sessions are left unused after created, temporary file directories will increase and accumulate, causing extra burden on the system. Therefore, you should avoid creating too many sessions and instead use shared sessions whenever possible.
During the process of uploading data, a Writer writing every 8 KB data will trigger a network action. If no network actions are triggered within 120 seconds, the server closes the connection. At this point, the Writer become unavailable, and you need to open a new Writer to write data.
We recommended that you use the [Tunnel-SDK-BufferedWriter] interface to upload data. This interface blocks users from blockId details, has an internal data buffer, and automatically retries failures.
When downloading data, the Reader has a similar mechanism. If no network I/O occurs for a long period of time, the connection is closed. We recommend that you run Read without inserting any interfaces from other systems.
MaxCompute Tunnel is designed for batch uploading rather than stream uploading. For stream uploading, you can use the [high-speed streaming data channel DataHub ] to write data only with milliseconds of latency.
Yes, MaxCompute Tunnel does not automatically build partitions.
Dship is a tool that uploads and downloads data through MaxCompute Tunnel.
The uploaded data appends to the file.
The routing function allows the Tunnel SDK to get the Tunnel endpoint by setting MaxCompute. That is, you can run the Tunnel SDK properly by setting the endpoint of MaxCompute.
There is no absolute answer to this question. It depends on a variety of factors, such as network performance, real-time requirements, the specific use of the data, and small files in clusters. Generally, we recommend that you limit data in a block between 64 MB and 256 MB if data is relatively large in size and needs to be continuously uploaded.
However, if only a batch of data is uploaded daily, you can extend that limit to around 1 GB.
This usually happens due to endpoint errors. Please check the endpoint configuration. A simple method is to check the network connectivity by using tools like telnet.
The data protection function has been enabled for the project. Only the project owner has the right to transfer data from one project to another if the project data is protected.
The maximum number of concurrent requests is exceeded. By default, MaxCompute Tunnel allows a maximum of 2,000 concurrent upload and download requests (quota). Each request, once it is sent, occupies one quota unit until it ends. Try the following solutions:
To learn more about Alibaba Cloud MaxCompute, visit https://www.alibabacloud.com/product/maxcompute
137 posts | 19 followers
FollowAlibaba Cloud MaxCompute - December 18, 2018
Alibaba Cloud MaxCompute - November 15, 2021
Alibaba Cloud MaxCompute - January 18, 2019
Alibaba Clouder - April 10, 2018
Alibaba Cloud MaxCompute - August 15, 2022
Alibaba Cloud MaxCompute - January 15, 2019
137 posts | 19 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreA Big Data service that uses Apache Hadoop and Spark to process and analyze data
Learn MoreMore Posts by Alibaba Cloud MaxCompute
Raja_KT March 18, 2019 at 4:24 pm
Interesting recommendation ..."It mainly provides the uploading and downloading of large batches of offline data, and is only applicable to scenarios where each batch is greater than or equal to 64 MB of data." and the use of datahub.