All Products
Search
Document Center

:Handle data expansion

Last Updated:Apr 02, 2024

This topic describes the causes of data expansion and measures that can be taken to handle data expansion issues.

Problem description

The amount of output data for a Fuxi task is much greater than the amount of its input data. The amount of input and output data can be obtained based on the I/O Record and I/O Bytes attributes of Fuxi tasks on the Logview page.

In the following figure, the amount of input data for a Fuxi task is about 1 GB and the amount of output data obtained after processing is about 1 TB. If 1 TB of data is processed on one instance, the data processing efficiency is significantly reduced.

Amount of input and output data

Causes and measures

The following table describes the possible causes of this issue and related measures that can be taken.

CauseDescriptionMeasure
Bug in codeThe code is defective. Examples:
  • The JOIN condition in the code is incorrect and is written as a Cartesian product.
  • User-defined table-valued functions (UDTFs) are invalid. As a result, the amount of output data is much greater than the amount of input data.
Fix bugs in the code.
Improper aggregation operations
Most aggregation operations are recursive and intermediate results are merged. In most cases, the amount of intermediate result data is not large, and the computational complexity of most aggregation operations is low. Therefore, these aggregation operations are not time-consuming even if the amount of data is large. However, for some aggregation operations, such as collect_list and median, all the intermediate result data must be retained. If these aggregation operations are used with other aggregation operations, data expansion may occur. Examples:
  • If you perform aggregation operations in a SELECT operation and perform DISTINCT operations to remove duplicates in different dimensions, data expansion occurs each time duplicates are removed.
  • If you use GROUPING SETS, CUBE, or ROLLUP, the size of intermediate result data may expand to many times larger than the original data size.
Do not perform aggregation operations that cause data expansion.
Improper JOIN operationsFor example, the left table of a JOIN operation contains a large amount of population data, and the right table is a dimension table, which records hundreds of rows of data for each gender. If you perform the JOIN operation on the data based on genders, the size of data in the left table may expand to hundreds of times larger than the original size. To prevent data expansion, you can aggregate the data in the rows of the right table before you perform the JOIN operation.