Hologres development standards - Hologres - Alibaba Cloud Documentation Center

This topic describes the Hologres development standards. You can prevent operation errors by complying with these standards.

Data domains

Data layers
A data warehouse is divided into the operational data store (ODS), common data model (CDM), and application data store (ADS) layers. The CDM layer consists of the data warehouse detail (DWD), data warehouse service (DWS), and dimension data (DIM) layers. In Hologres, layers are distinguished by schema.
- ODS (Operational Data Store)
- CDM (Common Data Model)
  - DWD (Data Warehouse Detail)
  - DWS (Data Warehouse Summary)
  - DIM (Dimension)
- ADS (Application Data Service)
You can divide the data warehouse of your enterprise into layers at different levels. If your enterprise has different business units, you can divide the data warehouse by business unit and use the abbreviation of a business unit name as the prefix of the corresponding schema to distinguish layers.
```
create schema ${bu}_ads;
create schema ${bu}_ads_dev;
create schema ${bu}_dwd;
create schema ${bu}_dwd_dev;
create schema ${bu}_dws;
create schema ${bu}_dws_dev;
create schema ${bu}_dim;
create schema ${bu}_dim_dev;
create schema ${bu}_ods;
create schema ${bu}_ods_dev;
```

Abbreviations of data domains

We recommend that you specify an abbreviation for each data domain and enforce the use of abbreviations within your enterprise to ensure consistency. The following table describes some abbreviations of data domains that you can use.

Data domain	Sample abbreviation
Transaction	trd
Product	itm
Log	log
Member and store	mbr
Purchase, stock, and sale management	dst
Sale and customer service	crm
Credit and risk control	rsk
Tool and service	tls
Logistics and express delivery	lgt

Naming conventions

Jobs
To distinguish between internal and synchronization jobs, you must comply with the following naming conventions:
- Internal asynchronous SQL jobs: holo_{target_table_name}. This distinguishes internal SQL jobs from jobs that involve external tables.
- Data import jobs: {source}2holo_{target_table_name}.
- Data export jobs: holo2{target}_{target_table_name}.

Tables

Layer	Format	Example
DWD	`${bu}_dwd.Data domain_Business process_[Custom root]_Suffix`	`taobao_dwd.trd_ord_flow`
DWS	`${bu}_dws.Data domain_Abbreviated statistic granularity_Business process_[{Custom root}]_Statistical period`	`taobao_dws.trd_all_dtr, taobao_dws.log_slr_pv_dtr`
DIM	`${bu}_dim.{Dimension}[_{Custom root}]`	`taobao_cdm.dim_itm`
ADS	`${bu}_ads.Business domain_Dimension_[{Custom root}]_{Update frequency}` Note Valid values of the Update frequency variable: d: daily update r: real-time update h: quasi-real-time update	`taobao_ads.trd_cate_d`

Table groups
If you require multiple table groups, you can name the table groups in the following format: ${bu}_{Layer}_{Business}_tg.
Views
Views can be named in the following formats:
- Formats
  - DWS: ${bu}_dws.Data domain_Abbreviated statistic granularity_Business process_[{Custom root}]_Statistical period_v
  - ADS: ${bu}_ads.Business domain_Dimension_[{Custom root}]_{Update frequency}_v
- Example
```
taobao_dws.trd_byr_itm_ord_cm_v
```
Foreign tables
We recommend that you use the name of the source MaxCompute table suffixed with ext as the name of a foreign table. Example:
```
taobao_dim.camp_ext
```
Temporary tables
We recommend that you use the name of the original table prefixed with tmp and suffixed with a sequence number as the name of a temporary table. Example:
```
taobao_dim.tmp_camp_01
```
Abbreviations
Statistical period
Abbreviation
Last one day
1d
Last N days
nd
Total days
td
Calendar week
cw
Calendar month
cm
Accrual to the current day
dtr
Accrual to the current hour
dhr

Table configuration standards

Internal tables
When you create an internal table, you must comply with the naming conventions for the table and its fields. In addition, you must set the lifecycle of the table based on your business requirements and add comments for the table and its fields. Take note of the following standards and recommendations when you configure an internal table:
- Standards that you must comply with to ensure successful creation of the internal table
  - Add concise and accurate comments for the table and its fields to facilitate data development and analysis.
  - Specify the time_to_live_in_seconds property in the statement used to create the table. This property is used to specify the lifecycle of the table.
  - Specify the distribute_key property in the statement used to create the table. This property is used to specify the distribution key.
    The selected distribution key must have sufficient cardinality and is most frequently used for the JOIN and GROUP BY operations. For example, in a table that stores product data, you may select the user_id and item_id fields as the distribution key. However, if the user_id field is most frequently used to join data, you must select only the user_id field as the distribution key.
  - Create tables involved in join queries within the same table group.
  - Use the same name and data type for the same entity ID in all fact tables and dimension tables. For example, if you specify user_id as the user ID in a transaction table, you must also use user_id rather than uid as the user ID in a dimension table. Consistency in data type eliminates the need to convert data types.
  - Use ds for the partition fields of all physical tables.
- Recommendations
  - Specify one of the bitmap_columns, segment_key, and cluster_key properties in the statement used to create the table.
  - Do not specify the dictionary_encoding_columns property if you are unsure about the cardinality of fields in the table. This property is used to build dictionary mappings. You can execute the following statement to skip dictionary mappings:
```
call set_table_property('table_name', 'dictionary_encoding_columns','')
```
  - Set the orientation property to column to set the table as a column-oriented table. The row-oriented storage type is also supported by Hologres.
    Note
    Do not set the table as a row-oriented table unless all the primary key columns of the table can be specified in each query on the table by using the equal to or IN operator. By default, if the orientation property is not specified, the column-oriented storage type is used.
  - Set the bitmap_columns property. This property is used to specify the fields for which Hologres builds bitmap indexes. Bitmap indexes can be used to filter stored data.
    - We recommend that you convert filter conditions to bitmap indexes. By default, bitmap indexes are built for all fields of the TEXT type.
    - We recommend that you do not specify a field that has a large number of values for this property, such as the user_id field. Instead, you can specify fields such as activity_id.
  - Set the event_time_column property to the fields that are used for real-time data writing, such as the field that stores event timestamps.
  - Set the clustering_key property to the field for which Hologres builds a clustered index. Hologres sorts data based on clustered indexes. You can accelerate RANGE and FILTER queries on the index field by using the clustered index. Only a single clustered index is supported for each table. This property is available to filter data by range. For example, this property can be used for the distribution of gross merchandise volume (GMV).
Foreign tables
Hologres allows you to use foreign tables to accelerate queries on data in source MaxCompute tables. This simplifies the data synchronization process. To improve computing performance, we recommend that you do not associate foreign tables with internal tables. Take note of the following standards and recommendations when you configure a foreign table for better foreign table management:
- Standard that you must comply with: Follow the naming conventions and use the name of a source MaxCompute table suffixed with ext as the name of a foreign table.
- Recommendations
  - Retain the DDL statement that is used to create the foreign table and manage different versions of the table.
  - Do not associate the foreign table with an internal table. Instead, synchronize data from the foreign table to the internal table.
Views
- Standard that you must comply with: Follow the naming conventions.
- Recommendations
  - Schedule tasks to configure dependencies in the entire business process.
  - Create individual views for tables that store data at different granularities to reduce the amount of data in computing.
    For example, you can create the cw (calendar week), cm (calendar month), nd (N days), and 1d (one day) views for four tables that store data at different granularities. If tables store data for different terminals, you can create the pc, wap, and app views. If tables store data for different data collection methods, you can create the ut and non-ut views.

Lifecycle of internal tables

Layer	Recommended lifecycle
DWD	For a table that stores day-level incremental data, we recommend that you set the lifecycle to a period of up to two years.
DWS	For a table that stores day-level incremental data, we recommend that you set the lifecycle to a period of up to two years.
DIM	For a large dimension table that is stored in ultimate storage mode, we recommend that you set the lifecycle to permanent. For a small dimension table, we recommend that you set the lifecycle to that of the source MaxCompute table. The difference between large and small dimension tables lies in the size of a single partition. If the size of a partition in a table exceeds 1 TB, the table is considered a large dimension table.

Recommendations

For a partitioned table, we recommend that you write real-time data to the partition generated on the same day and configure an appropriate time-to-live (TTL) period based on the layer. Do not write updated data to a partition whose TTL period exceeds the specified TTL period.

Table groups (optional)
Each database has default table groups and a default shard count. You can create table groups or change the shard count based on your business requirements for better performance. Recommendations:
- Do not create table groups unless necessary.
- Create a table group with a large shard count for tables that contain large amounts of data.
- Create a table group with a small shard count for a large number of tables that contain small amounts of data.
- Put tables involved in join queries in the same table group.

Field setting standards

Field types

The following table describes the field type standards that you must comply with.

Field name or suffix	Description	Example	Type or format
user_id	An auto-increment user ID field.	`user_id=232442843`	`int8`
item_id	A field about the product.	`item_id=63283278784383`	`int8`
member_id	A member ID field.	`member_id=b2b-dsajk2343821b`	TEXT
amt	A field about the amount.	`pay_ord_amt_1d_001=923.23`	NUMERIC
fee	A field about the cost.	`post_fee=923.23`	NUMERIC
cnt	A field about the quantity.	`pay_ord_byr_cnt_1d_001=923`	`int4/int8`
is_*	A Boolean field.	`is_pm=Y/is_pm=true`	TEXT/BOOL
ds	A partition field.	`ds=20210120`	YYYYMMDD

Data types
The data types of Hologres are compatible with those of PostgreSQL. Hologres supports some of the data types of PostgreSQL. For more information about Hologres data types and the mappings between MaxCompute and Hologres data types, see Data types.
Currency and precision
In regions within the Chinese mainland, use CNY as the currency unit. In other regions, use USD as the currency unit. Unless specifically stated, do not round data related to the amount of money. This can prevent inconsistent results of subsequent calculation.

SQL syntax

Standards that you must comply with:
- Use the SELECT * statement only for fields that need to be calculated in the outmost query and subquery. Specify field names for all operations.
- Use the COALESCE expression for empty fields and strings in WHERE clauses.
Recommendations
- Use the count (distinct) method to set the distribution key. If you use the count (distinct) method multiple times, rewrite the statement. Example:
```
select count(distinct userid)
     , count(distinct case when stat_date = '20201111' then userid end) 
from t group by cate_id;

Rewrite the statement to:
select count(1), sum(c) from 
(
  select userid
       , cate_id
       , cast(count(case when stat_date = '20201111' then 1 end) > 0) as c 
  from t 
  group by cate_id, userid
) t1 
group by cate_id;
```
- Execute the ANALYZE TABLE statement to analyze partitioned tables for offline scheduling tasks.
- If your tables are required for an extended period of time, use the detach and attach methods to manage historical partitions to avoid radical changes in metrics.

Statistical period	Abbreviation
Last one day	1d
Last N days	nd
Total days	td
Calendar week	cw
Calendar month	cm
Accrual to the current day	dtr
Accrual to the current hour	dhr