This topic describes the Hologres development standards. You can prevent operation errors by complying with these standards.
Data domains
Data layers
A data warehouse is divided into the operational data store (ODS), common data model (CDM), and application data store (ADS) layers. The CDM layer consists of the data warehouse detail (DWD), data warehouse service (DWS), and dimension data (DIM) layers. In Hologres, layers are distinguished by schema.
ODS (Operational Data Store)
CDM (Common Data Model)
DWD (Data Warehouse Detail)
DWS (Data Warehouse Summary)
DIM (Dimension)
ADS (Application Data Service)
You can divide the data warehouse of your enterprise into layers at different levels. If your enterprise has different business units, you can divide the data warehouse by business unit and use the abbreviation of a business unit name as the prefix of the corresponding schema to distinguish layers.
create schema ${bu}_ads; create schema ${bu}_ads_dev; create schema ${bu}_dwd; create schema ${bu}_dwd_dev; create schema ${bu}_dws; create schema ${bu}_dws_dev; create schema ${bu}_dim; create schema ${bu}_dim_dev; create schema ${bu}_ods; create schema ${bu}_ods_dev;
Abbreviations of data domains
We recommend that you specify an abbreviation for each data domain and enforce the use of abbreviations within your enterprise to ensure consistency. The following table describes some abbreviations of data domains that you can use.
Data domain
Sample abbreviation
Transaction
trd
Product
itm
Log
log
Member and store
mbr
Purchase, stock, and sale management
dst
Sale and customer service
crm
Credit and risk control
rsk
Tool and service
tls
Logistics and express delivery
lgt
Naming conventions
Jobs
To distinguish between internal and synchronization jobs, you must comply with the following naming conventions:
Internal asynchronous SQL jobs:
holo_{target_table_name}
. This distinguishes internal SQL jobs from jobs that involve external tables.Data import jobs:
{source}2holo_{target_table_name}
.Data export jobs:
holo2{target}_{target_table_name}
.
Tables
Layer
Format
Example
DWD
${bu}_dwd.Data domain_Business process_[Custom root]_Suffix
taobao_dwd.trd_ord_flow
DWS
${bu}_dws.Data domain_Abbreviated statistic granularity_Business process_[{Custom root}]_Statistical period
taobao_dws.trd_all_dtr, taobao_dws.log_slr_pv_dtr
DIM
${bu}_dim.{Dimension}[_{Custom root}]
taobao_cdm.dim_itm
ADS
${bu}_ads.Business domain_Dimension_[{Custom root}]_{Update frequency}
NoteValid values of the Update frequency variable:
d: daily update
r: real-time update
h: quasi-real-time update
taobao_ads.trd_cate_d
Table groups
If you require multiple table groups, you can name the table groups in the following format:
${bu}_{Layer}_{Business}_tg
.Views
Views can be named in the following formats:
Formats
DWS:
${bu}_dws.Data domain_Abbreviated statistic granularity_Business process_[{Custom root}]_Statistical period_v
ADS:
${bu}_ads.Business domain_Dimension_[{Custom root}]_{Update frequency}_v
Example
taobao_dws.trd_byr_itm_ord_cm_v
Foreign tables
We recommend that you use the name of the source MaxCompute table suffixed with
ext
as the name of a foreign table. Example:taobao_dim.camp_ext
Temporary tables
We recommend that you use the name of the original table prefixed with
tmp
and suffixed with a sequence number as the name of a temporary table. Example:taobao_dim.tmp_camp_01
Abbreviations
Statistical period
Abbreviation
Last one day
1d
Last N days
nd
Total days
td
Calendar week
cw
Calendar month
cm
Accrual to the current day
dtr
Accrual to the current hour
dhr
Table configuration standards
Internal tables
When you create an internal table, you must comply with the naming conventions for the table and its fields. In addition, you must set the lifecycle of the table based on your business requirements and add comments for the table and its fields. Take note of the following standards and recommendations when you configure an internal table:
Standards that you must comply with to ensure successful creation of the internal table
Add concise and accurate comments for the table and its fields to facilitate data development and analysis.
Specify the time_to_live_in_seconds property in the statement used to create the table. This property is used to specify the lifecycle of the table.
Specify the distribute_key property in the statement used to create the table. This property is used to specify the distribution key.
The selected distribution key must have sufficient cardinality and is most frequently used for the JOIN and GROUP BY operations. For example, in a table that stores product data, you may select the user_id and item_id fields as the distribution key. However, if the user_id field is most frequently used to join data, you must select only the user_id field as the distribution key.
Create tables involved in join queries within the same table group.
Use the same name and data type for the same entity ID in all fact tables and dimension tables. For example, if you specify user_id as the user ID in a transaction table, you must also use user_id rather than uid as the user ID in a dimension table. Consistency in data type eliminates the need to convert data types.
Use
ds
for the partition fields of all physical tables.
Recommendations
Specify one of the bitmap_columns, segment_key, and cluster_key properties in the statement used to create the table.
Do not specify the
dictionary_encoding_columns
property if you are unsure about the cardinality of fields in the table. This property is used to build dictionary mappings. You can execute the following statement to skip dictionary mappings:call set_table_property('table_name', 'dictionary_encoding_columns','')
Set the orientation property to column to set the table as a column-oriented table. The row-oriented storage type is also supported by Hologres.
NoteDo not set the table as a row-oriented table unless all the primary key columns of the table can be specified in each query on the table by using the equal to or IN operator. By default, if the orientation property is not specified, the column-oriented storage type is used.
Set the bitmap_columns property. This property is used to specify the fields for which Hologres builds bitmap indexes. Bitmap indexes can be used to filter stored data.
We recommend that you convert filter conditions to bitmap indexes. By default, bitmap indexes are built for all fields of the TEXT type.
We recommend that you do not specify a field that has a large number of values for this property, such as the user_id field. Instead, you can specify fields such as activity_id.
Set the event_time_column property to the fields that are used for real-time data writing, such as the field that stores event timestamps.
Set the clustering_key property to the field for which Hologres builds a clustered index. Hologres sorts data based on clustered indexes. You can accelerate RANGE and FILTER queries on the index field by using the clustered index. Only a single clustered index is supported for each table. This property is available to filter data by range. For example, this property can be used for the distribution of gross merchandise volume (GMV).
Foreign tables
Hologres allows you to use foreign tables to accelerate queries on data in source MaxCompute tables. This simplifies the data synchronization process. To improve computing performance, we recommend that you do not associate foreign tables with internal tables. Take note of the following standards and recommendations when you configure a foreign table for better foreign table management:
Standard that you must comply with: Follow the naming conventions and use the name of a source MaxCompute table suffixed with
ext
as the name of a foreign table.Recommendations
Retain the DDL statement that is used to create the foreign table and manage different versions of the table.
Do not associate the foreign table with an internal table. Instead, synchronize data from the foreign table to the internal table.
Views
Standard that you must comply with: Follow the naming conventions.
Recommendations
Schedule tasks to configure dependencies in the entire business process.
Create individual views for tables that store data at different granularities to reduce the amount of data in computing.
For example, you can create the cw (calendar week), cm (calendar month), nd (N days), and 1d (one day) views for four tables that store data at different granularities. If tables store data for different terminals, you can create the pc, wap, and app views. If tables store data for different data collection methods, you can create the ut and non-ut views.
Lifecycle of internal tables
Layer
Recommended lifecycle
DWD
For a table that stores day-level incremental data, we recommend that you set the lifecycle to a period of up to two years.
DWS
For a table that stores day-level incremental data, we recommend that you set the lifecycle to a period of up to two years.
DIM
For a large dimension table that is stored in ultimate storage mode, we recommend that you set the lifecycle to permanent. For a small dimension table, we recommend that you set the lifecycle to that of the source MaxCompute table.
The difference between large and small dimension tables lies in the size of a single partition. If the size of a partition in a table exceeds 1 TB, the table is considered a large dimension table.
Recommendations
For a partitioned table, we recommend that you write real-time data to the partition generated on the same day and configure an appropriate time-to-live (TTL) period based on the layer. Do not write updated data to a partition whose TTL period exceeds the specified TTL period.
Table groups (optional)
Each database has default table groups and a default shard count. You can create table groups or change the shard count based on your business requirements for better performance. Recommendations:
Do not create table groups unless necessary.
Create a table group with a large shard count for tables that contain large amounts of data.
Create a table group with a small shard count for a large number of tables that contain small amounts of data.
Put tables involved in join queries in the same table group.
Field setting standards
Field types
The following table describes the field type standards that you must comply with.
Field name or suffix
Description
Example
Type or format
user_id
An auto-increment user ID field.
user_id=232442843
int8
item_id
A field about the product.
item_id=63283278784383
int8
member_id
A member ID field.
member_id=b2b-dsajk2343821b
TEXT
*amt*
A field about the amount.
pay_ord_amt_1d_001=923.23
NUMERIC
*fee*
A field about the cost.
post_fee=923.23
NUMERIC
*cnt*
A field about the quantity.
pay_ord_byr_cnt_1d_001=923
int4/int8
is_*
A Boolean field.
is_pm=Y/is_pm=true
TEXT/BOOL
ds
A partition field.
ds=20210120
YYYYMMDD
Data types
The data types of Hologres are compatible with those of PostgreSQL. Hologres supports some of the data types of PostgreSQL. For more information about Hologres data types and the mappings between MaxCompute and Hologres data types, see Data types.
Currency and precision
In regions within the Chinese mainland, use CNY as the currency unit. In other regions, use USD as the currency unit. Unless specifically stated, do not round data related to the amount of money. This can prevent inconsistent results of subsequent calculation.
SQL syntax
Standards that you must comply with:
Use the
SELECT *
statement only for fields that need to be calculated in the outmost query and subquery. Specify field names for all operations.Use the COALESCE expression for empty fields and strings in WHERE clauses.
Recommendations
Use the count (distinct) method to set the distribution key. If you use the count (distinct) method multiple times, rewrite the statement. Example:
select count(distinct userid) , count(distinct case when stat_date = '20201111' then userid end) from t group by cate_id; Rewrite the statement to: select count(1), sum(c) from ( select userid , cate_id , cast(count(case when stat_date = '20201111' then 1 end) > 0) as c from t group by cate_id, userid ) t1 group by cate_id;
Execute the ANALYZE TABLE statement to analyze partitioned tables for offline scheduling tasks.
If your tables are required for an extended period of time, use the detach and attach methods to manage historical partitions to avoid radical changes in metrics.