Connect Tablestore to DataWorks - Tablestore - Alibaba Cloud Documentation Center

Tablestore allows you to connect a Tablestore instance to DataWorks. This way, you can manage and use the data of the instance in DataWorks. To connect an instance to DataWorks, you must add a Tablestore data source in DataWorks. After you add the Tablestore data source, you can configure a data synchronization task in DataWorks to synchronize and migrate Tablestore data and execute SQL statements to query Tablestore data. This topic describes how to add a Tablestore data source and execute SQL statements to query Tablestore data.

Background information

DataWorks is an end-to-end big data development and governance platform that provides data warehousing, data lake, and data lakehouse solutions based on big data compute engines, such as MaxCompute, Hologres, E-MapReduce (EMR), AnalyticDB, and Cloudera Data Platform (CDP). For more information, see What is DataWorks?

Scenarios

After you connect a Tablestore instance to DataWorks, you can process and analyze data in DataWorks in an efficient manner and use the data in various big data scenarios. The following section describes some common scenarios:

Large-scale data storage and analysis
Tablestore provides high throughput and large storage. DataWorks supports big data analytics. After the connection, you can use SQL statements, MapReduce tasks, or custom code in DataWorks to query and process Tablestore data. For example, you can analyze logs and user behavior.
Real-time data processing
DataWorks supports real-time computing tasks, such as Flink tasks. After the connection, you can consume, process, and analysis Tablestore data in real time in scenarios such as real-time monitoring, real-time reporting, and real-time recommendation.
Offline data processing and batch processing
DataWorks provides task scheduling capabilities. After the connection, batch processing tasks involving Tablestore data are triggered to run as scheduled. The tasks can be extract, transform, load (ETL) tasks that convert raw data into the required formats for analysis or scheduled tasks that aggregate and analyze data.
Data lake and data warehouse building
Tablestore can be used as the storage layer of a data lake that stores raw or semi-structured data. After the connection, you can use DataWorks to build a data pipeline to process and cleanse Tablestore data and then import the data into MaxCompute or other data warehouses for further analysis and mining.
BI reporting and data visualization
DataWorks can be integrated with business intelligence (BI) reporting tools, such as Quick BI. After the connection, DataWorks can directly read data from Tablestore and generate various BI reports and dashboards to provide data support for decision-making of enterprises.
Machine learning and AI projects
You can store the data for training, models, and features in Tablestore. After the connection, you can use DataWorks to call tasks that are created in Platform for AI (PAI) for model training and prediction based on the Tablestore data. This allows you to perform end-to-end development from data preparation to model deployment.

Preparations

Complete the following operations in the Resource Access Management (RAM) console:
- Create a RAM user and grant the RAM user Tablestore management permissions (AliyunOTSFullAccess) and DataWorks management permissions (AliyunDataWorksFullAccess). For more information, see Create a RAM user and Grant permissions to the RAM user.
  Warning
  If the AccessKey pair of your Alibaba Cloud account is leaked, your resources are exposed to potential risks. We recommend that you use the AccessKey pair of a RAM user to perform operations. This prevents the AccessKey pair of your Alibaba Cloud account from being leaked.
- Create an AccessKey pair for the RAM user. For more information, see Create an AccessKey pair.
Complete the following operations in the Tablestore console:
- Activate Tablestore and create an instance. For more information, see Activate Tablestore and Create an instance.
- If you create an instance for the Wide Column model, create a data table on the instance and write data to the data table. For more information, see Operations on a data table and Write data.
- If you create an instance for the TimeSeries model, create a time series table on the instance and write data to the table. For more information, see Operations on a time series table and Write time series data.
Complete the following operations in the DataWorks console:
- Activate DataWorks and create a workspace. For more information, see Activate DataWorks and Create a workspace.
- Obtain the permissions to query Tablestore data sources in the DataAnalysis service of DataWorks. For more information, see Use the data query and analysis control feature.
- Add the account that you want to use as a member of the workspace, and assign the Data Analyst, Model Developer, Development, O&M, Workspace Manager, or Project Owner role to the member. For more information, see Add workspace members and assign roles to them.

Usage notes

You can connect Tablestore instances that are created for the Wide Column model and TimeSeries model to DataWorks.
If you use the SQL query feature in the DataAnalysis service of DataWorks, you can query only data from data sources in the workspace on which you have access permissions. Before you run SQL queries, you must contact an administrator to add you as a member of the workspace and assign the Data Analyst, Model Developer, Development, O&M, Workspace Manager, or Project Owner role to the member.
The SQL query feature in the DataAnalysis service of DataWorks provides the same capabilities as the SQL query feature of Tablestore. For more information, see SQL features.

Procedure

After you connect a Tablestore instance to DataWorks, you can use the SQL query feature to query and analyze Tablestore data.

Step 1: Add a Tablestore data source in DataWorks

To add a Tablestore database as the data source, perform the following steps:

Go to the Data Integration page.
1. Log on to the DataWorks console as the project administrator.
2. In the left-side navigation pane, click Data Integration.
3. On the Data Integration page, select a region, select a workspace that you want to manage, and then click Go Data Integration.
In the left-side navigation pane of the Data Integration page, click Data source.
On the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, click the Tablestore block.

In the Add OTS Data Source dialog box, configure the parameters. The following table describes the parameters.

Parameter	Description
Data Source Name	The name of the data source. The name can contain letters, digits, and underscores (_) and must start with a letter.
Data Source Description	The description of the data source. The description can be up to 80 characters in length.
Endpoint	The endpoint of the Tablestore instance. For more information, see Endpoints. If the Tablestore instance and the resources of the destination data source are in the same region, specify a virtual private cloud (VPC) endpoint. If the Tablestore instance and the resources of the destination data source are in different regions, specify a public endpoint.
Tablestore Instance Name	The name of the Tablestore instance. For more information, see Instances.
AccessKey ID	The AccessKey ID and AccessKey secret of your Alibaba Cloud account or RAM user. For more information about how to create an AccessKey pair, see Create an AccessKey pair.
AccessKey Secret

Test the network connectivity between the data source and a resource group.
To ensure that your synchronization task can be run as expected, test the network connectivity between the data source and the resource group that is required by the synchronization task.
Important
A synchronization task can use only one type of resource group. By default, only shared resource groups for Data Integration are displayed in the resource group list. To ensure the stability and performance of data synchronization, we recommend that you use an exclusive resource group for Data Integration.
1. Click Purchase to create a resource group or click Associate Purchased Resource Group to associate an existing resource group. For more information, see Create and use an exclusive resource group for Data Integration.
2. Find the resource group that you want to manage and click Test Network Connectivity in the Connection Status column.
  If Connectable is displayed in the Connection Status column, the connectivity test is passed.
After the data source passes the connectivity test, click Complete Creation.
In the data source list, you can view the newly created data source.

Step 2: Query Tablestore data by using the SQL query feature in DataWorks

Tablestore supports the Wide Column and TimeSeries models. The data query methods vary based on the model of your instance. You must select a method based the model of your instance.

Query data on an instance for the Wide Column model

Go to the data analysis page.
1. Log on to the DataWorks console as the project administrator.
2. In the left-side navigation pane, choose DataAnalysis > SQL Query.
3. On the SQL Query page, select a region, select a workspace that you want to manage, and then click Go to SQL Query.
Create an SQL query file.
1. In the left-side navigation pane of the SQL Query page, move the pointer over the plus icon to the right of My Files and select Create File.
2. In the Create File dialog box, enter a file name and click OK.
  In the left-side navigation pane, you can view the created file.

Open the SQL editor for the created file and configure the information about the data source whose data you want to query.

In the left-side navigation pane of the SQL Query page, click My Files and click the created file. In the upper-right corner of the SQL editor that appears, click the icon.

In the dialog box that appears, configure the parameters. The following table describes the parameters.

Parameter	Description
Workspace	The workspace whose data you want to query. Select a workspace based on your business requirements.
Data Source Type	The type of the data source whose data you want to query. Select Tablestore.
Data Source Name	The name of the data source whose data you want to query. Select the data source that you added. The table that you want to query belongs to the instance that is specified for the data source. Important The Data Source Name drop-down list displays only the data sources that you are authorized to use. If you want to use other data sources, contact an administrator to grant the required permissions on the Security Center page. For more information, see Use the data query and analysis control feature.

Click OK.

Create a mapping table for a table and execute SQL statements.
If a mapping table is created for your table, you can skip this step.
Important
When you create a mapping table for a table, make sure that the data types of the fields in the mapping table match the data types of the fields in the table. For more information, see Data type mappings in SQL.
1. In the SQL editor for the file, write an SQL statement to create a mapping table. For more information, see Create a mapping table for a table.
  When you create a mapping table for a table, make sure that the mapping table has the same name and primary key columns as the table.
  Note
  If a search index is created for a data table, you can create a mapping table for the search index. This way, you can execute SQL statements to query data based on the search index. For more information, see Create mapping tables for search indexes.
  The following sample code provides an example on how to create a mapping table for the test_table table.
```
CREATE TABLE `test_table` (
    `pk` VARCHAR(1024), 
    `long_value` BIGINT(20), 
    `double_value` DOUBLE, 
    `string_value` MEDIUMTEXT, 
    `bool_value` BOOL, 
    PRIMARY KEY(`pk`)
);
```
2. Click the SQL statement and click Run in the toolbar.
  The execution result is displayed on the Result tab of the SQL editor.
Execute SQL statements to query data in the table.
1. In the SQL editor for the file, write a SELECT statement to query data. For more information, see Query data.
  The following sample code provides an example on how to query data in the test_table table and obtain up to 20 rows of data.
```
SELECT `pk`, `long_value`, `double_value`, `string_value`, `bool_value` FROM test_table LIMIT 20;
```
2. Click the SQL statement and click Run in the toolbar.
  The execution result is displayed on the Result tab of the SQL editor.

Query data on an instance for the TimeSeries model

Go to the data analysis page.
1. Log on to the DataWorks console as the project administrator.
2. In the left-side navigation pane, choose DataAnalysis > SQL Query.
3. On the SQL Query page, select a region, select a workspace that you want to manage, and then click Go to SQL Query.
Create an SQL query file.
1. In the left-side navigation pane of the SQL Query page, move the pointer over the plus icon to the right of My Files and select Create File.
2. In the Create File dialog box, enter a file name and click OK.
  In the left-side navigation pane, you can view the created file.

Open the SQL editor for the created file and configure the information about the data source whose data you want to query.

In the left-side navigation pane of the SQL Query page, click My Files and click the created file. In the upper-right corner of the SQL editor that appears, click the icon.

In the dialog box that appears, configure the parameters. The following table describes the parameters.

Parameter	Description
Workspace	The workspace whose data you want to query. Select a workspace based on your business requirements.
Data Source Type	The type of the data source whose data you want to query. Select Tablestore.
Data Source Name	The name of the data source whose data you want to query. Select the data source that you added. The table that you want to query belongs to the instance that is specified for the data source. Important The Data Source Name drop-down list displays only the data sources that you are authorized to use. If you want to use other data sources, contact an administrator to grant the required permissions on the Security Center page. For more information, see Use the data query and analysis control feature.

Click OK.

Create a mapping table for a table and execute SQL statements.
After you create a time series table, the system automatically creates a mapping table in the single-value model and a mapping table for time series metadata for the time series table. The name of the mapping table in the single-value model is the same as the name of the time series table. The name of the mapping table for time series metadata is the name of the time series table suffixed by ::meta.
If you want to use a mapping table in the multi-value model to query time series data in the time series table, you must create a mapping table in the multi-value model. If you do not want to use a mapping table in the multi-value model to query time series data, you do not need to create a mapping table in the multi-value model.
Important
When you create a mapping table for a table, make sure that the data types of the fields in the mapping table match the data types of the fields in the table. For more information, see Data type mappings in SQL. For more information about the data types of fields in the mapping tables for a time series table, see Mapping tables for a time series table in SQL.
1. In the SQL editor for the file, write an SQL statement to create a mapping table. For more information, see Mapping tables for a time series table in SQL.
  The following sample code provides an example on how to create a mapping table in the multi-value model named timeseries_table::muti_model for the time series table. The metrics in the mapping table are cpu, memory, and disktop.
```
CREATE TABLE `timeseries_table::muti_model` (
  `_m_name` VARCHAR(1024), 
  `_data_source` VARCHAR(1024), 
  `_tags` VARCHAR(1024),
  `_time` BIGINT(20),
  `cpu` DOUBLE(10),
  `memory` DOUBLE(10),
  `disktop` DOUBLE(10),
  PRIMARY KEY(`_m_name`,`_data_source`,`_tags`,`_time`)
);
```
2. Click the SQL statement and click Run in the toolbar.
  The execution result is displayed on the Result tab of the SQL editor.
Execute SQL statements to query data in the table.
1. In the SQL editor for the file, write a SELECT statement to query data. For more information, see SQL examples.
  - Query data by using a mapping table in the single-value model
    The following sample code provides an example on how to query the data whose metric type is basic_metric in the time series data table.
```
SELECT * FROM timeseries_table WHERE _m_name = "basic_metric" LIMIT 10;
```
  - Query data by using a mapping table for time series metadata
    The following sample code provides an example on how to query the time series whose metric name is basic_metric in the mapping table for time series metadata.
```
SELECT * FROM  `timeseries_table::meta` WHERE _m_name = "basic_metric" LIMIT 100;
```
  - Query data by using a mapping table in the multi-value model
    The following sample code provides an example on how to query information about the metrics in the time series whose cpu value is greater than 20.0 by using the mapping table in the multi-value model.
```
SELECT cpu,memory,disktop FROM `timeseries_table::muti_model` WHERE cpu > 20.0 LIMIT 10;
```
2. Click the SQL statement and click Run in the toolbar.
  The execution result is displayed on the Result tab of the SQL editor.

Billing rules

Tablestore fees

When you execute SQL statements in DataWorks to access Tablestore resources, no fees are generated for the SQL statements. However, if operations such as table scans or index queries are performed during the execution of the SQL statements, fees are generated for the operations. For more information, see Billable items of SQL query.

You are charged for the read and write throughput that is consumed by read and write requests in Tablestore. You are separately charged for metered read and write capacity units (CUs) and reserved read and write CUs. The type of the instance you access determines whether metered read and write CUs or reserved read and write CUs are consumed.

Note

For information about instance types and CUs, see Instances and Read and write throughput.

Other resource fees

When you use DataWorks tools, you are charged for specific features and resources. For more information, see Purchase guide.

References

You can use the SQL query feature in the Tablestore console or Tablestore CLI, or by using Tablestore SDKs, JDBC driver, or Tablestore driver for Go. For more information, see Usage method of SQL query.
You can use computing engines, such as MaxCompute, Spark, Hive, HadoopMR, Function Compute, Realtime Compute for Apache Flink, and PrestoDB, to compute and analyze data in tables. For more information, see Overview.
If you want to accelerate data queries and computing by executing SQL statements, you can create a secondary index or a search index. For more information, see Index selection policy and Computing pushdown.
You can also use Data Integration of DataWorks to migrate data from data sources, such as MySQL, Oracle, Kafka, HBase, MaxCompute, PolarDB-X 2.0, and Tablestore, to Tablestore. For more information, see Data Integration.