This topic describes how to use a PyODPS node to read data from a partitioned table.
Prerequisites
The following operations are completed:- MaxCompute is activated. For more information, see Activate MaxCompute.
- DataWorks is activated. For more information, see Activate DataWorks.
- A workflow is created in the DataWorks console. In this example, a workflow is created for a DataWorks workspace in basic mode. For more information, see Create a workflow.
Procedure
- Prepare test data.
- Create a partitioned table and a source table, and import data to the source table. For more information, see Create tables and upload data.In this example, use the following table creation statements and source data.
- Execute the following statement to create a partitioned table named user_detail:
create table if not exists user_detail ( userid BIGINT comment 'user ID', job STRING comment 'job type', education STRING comment 'education level' ) comment 'user information table' partitioned by (dt STRING comment 'date',region STRING comment 'region');
- Execute the following statement to create a source table named user_detail_ods:
create table if not exists user_detail_ods ( userid BIGINT comment 'user ID', job STRING comment 'job type', education STRING comment 'education level', dt STRING comment 'date', region STRING comment 'region' );
- Create a source data file named user_detail.txt and save the following data to the file. Import the data to the user_detail_ods table.
0001,Internet,bachelor,20190715,beijing 0002,education,junior college,20190716,beijing 0003,finance,master,20190715,shandong 0004,Internet,master,20190715,beijing
- Execute the following statement to create a partitioned table named user_detail:
- Log on to the DataWorks console and click Workspaces in the left-side navigation pane. On the Workspaces page, find the target workspace and click Data Analytics in the Actions column. On the Data Development tab, right-click the target workflow in the Business process section and choose New > MaxCompute > ODPS SQL.
- In the New node dialog box, set the Node name parameter and click Submit.
- On the configuration tab of the ODPS SQL node, enter the following code in the code editor:
insert overwrite table user_detail partition (dt,region) select userid,job,education,dt,region from user_detail_ods;
- Click the Run icon in the toolbar to insert the data from the user_detail_ods table to the user_detail table.
- Create a partitioned table and a source table, and import data to the source table. For more information, see Create tables and upload data.
- Use a PyODPS node to read data from a partitioned table.
- Log on to the DataWorks console.
- In the left-side navigation pane, click Workspaces.
- On the Workspaces page, find the target workspace and click Data Analytics in the Actions column.
- On the Data Development tab, right-click the target workflow in the Business process section and choose New > MaxCompute > PyODPS 2.
- In the New node dialog box, set the Node name parameter and click Submit.
- On the configuration tab of the PyODPS 2 node, enter the following code in the code editor:
import sys from odps import ODPS reload(sys) print('dt=' + args['dt']) # Set UTF-8 as the default encoding format. sys.setdefaultencoding('utf8') # Obtain the partitioned table. t = o.get_table('user_detail') # Check whether the specified partition exists. print t.exist_partition('dt=20190715,region=beijing') # View all partitions in the partitioned table. for partition in t.partitions: print partition.name # You can use one of the following methods to query data in the partitioned table: # Method 1 with t.open_reader(partition='dt=20190715,region=beijing') as reader1: count = reader1.count print("Query data in the partitioned table by using Method 1:") for record in reader1: print record[0],record[1],record[2] # Method 2 print("Query data in the partitioned table by using Method 2:") reader2 = t.open_reader(partition='dt=20190715,region=beijing') for record in reader2: print record["userid"],record["job"],record["education"] # Method 3 print("Query data in the partitioned table by using Method 3:") for record in o.read_table('user_detail', partition='dt=20190715,region=beijing'): print record["userid"],record["job"],record["education"]
- Click Advanced run (run with parameters) in the toolbar.
- In the Parameters dialog box, set the parameters and click Confirm.Parameter description:
- Scheduling Resource Group: Set this parameter to Common scheduler resource group.
- dt: Set this parameter to dt=20190715.
- View the running result of the PyODPS 2 node on the Run Log tab.