By Weisong
In the new retail industry, granularity to store order transaction data tables in the data warehouse is the most common issue. There are three common granularities: (1) store order data by Stock Keeping Unit (SKU); (2) store order data by Standard Product Unit (SPU); (3) store order data by transaction order. The third mode stores order data by transaction order, which is more suitable for transaction detail data tables than for data warehouse storage. Therefore, this article mainly describes how to merge an SKU table and an SPU table into one table.
Why do the data tables store data using two different granularities? Generally, it is caused by two determinants. The first determinant is related to the characteristics of the data analysis business for the retail industry. The second is related to data volume.
1) Characteristics of the Data Analysis Business for the Retail Industry
2) Data Volume
From the perspectives of industry characteristics and data volume, the preceding sections analyze why a company often has SKU and SPU sales data tables. In addition, there are many situations where SPU is used for inventory, SKU for sales, SPU for unified distribution, and SKU for delivery.
However, in the field of data warehouse analysis, it is often necessary to merge data tables using different granularities to obtain integrated analysis results. For example, make an integrated analysis of the current year's commodity sales data and historical cumulative commodity sales data, or make a correlation analysis of the current year's commodity sales data and the current year's inventory data. In these cases, integrated analysis or correlation analysis is often performed on the SKU table and SPU table.
This article introduces a simple data preprocessing method that merge an SKU table and an SPU table into one table, as well as provides some ideas for easy reference.
Consider two data tables, sku_table and spu_table, storing commodity sales data by SKU and SPU respectively.
--Statements to create tables
create table if not exists test_project.sku_table (
shopno string comment 'shop number',
prodno string comment 'product number',
sizeno string comment 'size number',
rtam_lastyy Double comment 'the current year’s sales of a commodity using SKU granularity'
) comment 'SKU table' PARTITIONED BY (ds string comment 'partition date') lifecycle 200;
create table if not exists test_project.spu_table (
shopno string comment 'shop number',
prodno string comment 'product number',
rtam_lishileiji Double comment 'historical cumulative sales of a commodity using SPU granularity'
) comment ' SPU table' PARTITIONED BY (ds string comment 'partition date') lifecycle 200;
Note: In the preceding data table, prodno is the SPU. The combination of prodno and sizeno becomes SKU.
For sku_table, this table uses shopno + prodno + sizeno as the primary key to indicate one data record. One piece of commodity sales data is recorded with "store + SPU + specification/color" of one sales record. This is common in case of new retail industry’s data warehouse.
For spu_table, this table uses shopno + prodno as the composite primary key to indicate one piece of data. One piece of commodity sales data is recorded with "store + SPU" of one sales record, which is common in the data warehouse of the new retail industry.
select * from test_project.sku_table where ds = '20200417';
select * from test_project.spu_table where ds = '20200417';
First, create a data table to store the merged data as shown below.
create table if not exists test_project.sku_spu_merge_table (
shopno string comment 'shop number',
prodno string comment 'product number',
sizeno string comment 'size number',
rtam_lastyy Double comment 'the current year’s sales of a commodity using SKU granularity',
rtam_lishileiji Double comment 'historical cumulative sales of a commodity using SPU granularity'
) comment 'merge SKU table (the current year’s sales) and SPU table (historical cumulative sales) into one table' PARTITIONED BY (ds string comment 'partition date') lifecycle 200;
Now, import the two tables to the test_project.sku_spu_merge_table table as shown below.
insert overwrite table
test_project.sku_spu_merge_table partition (ds = '20200417')
select
COALESCE (v1.shopno, v2.shopno) as shopno,
COALESCE (v1.prodno, v2.prodno) as prodno,
v1.sizeno,
v1.rtam_lastyy,
v2.rtam_lishileiji
from
(
select
shopno,
prodno,
sizeno,
rtam_lastyy,
row_number() over ( partition by shopno,prodno order by sizeno) size_order
from
test_project.sku_table
where ds = '20200417'
) v1
full outer join
(
select
shopno,
prodno,
rtam_lishileiji,
1 join_order
from
test_project.spu_table
where ds = '20200417'
) v2
on (v1.shopno = v2.shopno) and (v1.prodno = v2.prodno) and (v1.size_order = v2.join_order);
Let’s describe the SQL statement for more clarity.
1) First, use the row_number() over (partition by shopno,prodno order by sizeno) function to group sku_table by shopno,prodno (that is, store + SPU), then sort by sizeno, and assign the sorting result to size_order.
2) Next, add the last column join_order to sku_table and assign the fixed value 1.
3) Finally, perform a full outer join to sku_table and spu_table, with (v1.shopno = v2.shopno) and (v1.prodno = v2.prodno) and (v1.size_order = v2.join_order) as the join condition. In particular, the condition v1.size_order = v2.join_order is to associate the first row (that is, size_order=1) of sku_table after it is grouped by shopno,prodno with spu_table (that is, join_order=1).
The following figure shows the final result.
Finally, sku_table and spu_table are merged into one table.
The advantage of the preceding method is that when integrated data analysis is performed, such as summation and averaging, the system does not encounter calculation errors due to repeated calculation of SPU data.
The data integration method for SKU and SPU data presented in this article is only a reference. You may build on the ideas mentioned in this article using other methods. For example, SPU sales data is calculated according to the number of SKUs under the SPU and then assigned to the merged data table. Another idea is to perform correlation analysis for the data table by using SPU granularity and the order list to obtain the sales data of each SKU under the SPU and then merge both the data tables using SKU granularity. Just decide which scheme to use based on the actual business situation and project requirements.
The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
2,599 posts | 765 followers
Followandy.zh - November 7, 2019
Alibaba Clouder - January 21, 2020
AliCloud-TechLab - August 25, 2021
ITDSN - May 11, 2020
Alibaba Cloud Community - September 10, 2024
Alibaba Cloud Community - June 1, 2023
2,599 posts | 765 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreSecure and easy solutions for moving you workloads to the cloud
Learn MoreTair is a Redis-compatible in-memory database service that provides a variety of data structures and enterprise-level capabilities.
Learn MoreMore Posts by Alibaba Clouder