Handle data expansion - - Alibaba Cloud Documentation Center

This topic describes the causes of data expansion and measures that can be taken to handle data expansion issues.

Problem description

The amount of output data for a Fuxi task is much greater than the amount of its input data. The amount of input and output data can be obtained based on the I/O Record and I/O Bytes attributes of Fuxi tasks on the Logview page.

In the following figure, the amount of input data for a Fuxi task is about 1 GB and the amount of output data obtained after processing is about 1 TB. If 1 TB of data is processed on one instance, the data processing efficiency is significantly reduced.

Causes and measures

The following table describes the possible causes of this issue and related measures that can be taken.

Cause	Description	Measure
Bug in code	The code is defective. Examples: The `JOIN` condition in the code is incorrect and is written as a Cartesian product. User-defined table-valued functions (UDTFs) are invalid. As a result, the amount of output data is much greater than the amount of input data.	Fix bugs in the code.
Improper aggregation operations	Most aggregation operations are recursive and intermediate results are merged. In most cases, the amount of intermediate result data is not large, and the computational complexity of most aggregation operations is low. Therefore, these aggregation operations are not time-consuming even if the amount of data is large. However, for some aggregation operations, such as `collect_list` and `median`, all the intermediate result data must be retained. If these aggregation operations are used with other aggregation operations, data expansion may occur. Examples: If you perform aggregation operations in a `SELECT` operation and perform DISTINCT operations to remove duplicates in different dimensions, data expansion occurs each time duplicates are removed. If you use `GROUPING SETS`, `CUBE`, or `ROLLUP`, the size of intermediate result data may expand to many times larger than the original data size.	Do not perform aggregation operations that cause data expansion.
Improper `JOIN` operations	For example, the left table of a `JOIN` operation contains a large amount of population data, and the right table is a dimension table, which records hundreds of rows of data for each gender. If you perform the `JOIN` operation on the data based on genders, the size of data in the left table may expand to hundreds of times larger than the original size.	To prevent data expansion, you can aggregate the data in the rows of the right table before you perform the `JOIN` operation.