All Products
Search
Document Center

:FAQ about small file optimization and job diagnostics

Last Updated:Oct 18, 2024

This topic provides answers to some frequently asked questions about small file optimization and job diagnostics.

Category

FAQ

Small file optimization

In which scenarios are small files generated for MaxCompute? How do I resolve the small file issue?

Job diagnostics

In which scenarios are small files generated for MaxCompute? How do I resolve the small file issue?

  • Scenarios:

    MaxCompute uses Apsara Distributed File System (Pangu) for block storage. In most cases, files whose size is smaller than the block size are considered small files. The default block size is 64 MB.

    Small files may be generated in the following scenarios:

    • A large number of small files are generated in the Reduce stage.

    • Small files are generated during Tunnel-based data collection.

    • Temporary files generated during job execution and expired files retained in the recycle bin may be small files. The files are classified into the following types:

      • TABLE_BACKUP: the tables that are retained in the recycle bin for more than a specific number of days.

      • FUXI_JOB_TMP: the temporary directories that are generated when the job is running.

      • TMP_TABLE: the temporary tables that are generated when the job is running.

      • INSTANCE: the logs that are retained in metadata tables when the job is running.

      • LIFECYCLE: the tables or partitions that reach end of lifecycle.

      • INSTANCEPROFILE: the profile information after the job is submitted and executed.

      • VOLUME_TMP: the data that has no metadata information and has the mapped directory on Apsara Distributed File System (Pangu).

      • TEMPRESOURCE: the one-time temporary resource files that are used by user-defined functions (UDFs).

      • FAILOVER: the temporary files that are retained when system failovers occur.

    You can run the following command to query the number of small files in a table:

    desc extended + Table name          
  • Impact:

    A large number of small files have the following impacts:

    • The startup of map instances is negatively affected. By default, one small file corresponds to one instance. A large number of small files result in resource waste and negatively affect the overall execution performance.

    • A large number of small files cause high loads on Apsara Distributed File System (Pangu) and affect the efficient use of storage space. In extreme cases, Apsara Distributed File System (Pangu) may become unavailable.

  • Solutions:

    Different solutions are provided to handle small files that are generated in different scenarios.

    • Small files that are generated in the Reduce stage: Execute the INSERT OVERWRITE statement on the source table or partition or write data to a new table and delete the source table.

    • Small files that are generated during Tunnel-based data collection:

      • When you call a Tunnel SDK, upload files each time the file cache reaches 64 MB.

      • When you use the MaxCompute client, do not frequently upload small files. We recommend that you upload files at the same time when a large number of files are collected.

      • When you import data to a partitioned table, we recommend that you configure lifecycles for the partitions in the table. This way, expired data can be automatically cleared.

      • Execute the INSERT OVERWRITE statement on the source table or partition.

      • Run the following command to merge small files:

        ALTER TABLE tablename [PARTITION] MERGE SMALLFILES;                         

What do I do if an error is reported when I perform concurrent insert operations?

  • Problem description

    The following error message is returned when concurrent insert operations are performed:

    ODPS-0110999: Critical! Internal error happened in commit operation and rollback failed, possible breach of atomicity - Rename directory failed during DDLTask.       
  • Possible causes

    MaxCompute does not support concurrency control. Multiple jobs may be performed at the same time to modify the table. In this case, a concurrency conflict occurs at a low probability when an operation on the META module is performed. As a result, the error message is returned. The issue also occurs when the ALTER and INSERT operations are performed at the same time.

  • Solutions

    We recommend that you change the table to a partitioned table to ensure that each SQL statement inserts data to a separate partition. This way, you can perform concurrent operations on the table.

What do I do if the ODPS-0130121 error is reported when a job is running?

  • Problem description

    When a job is running, the following error message is returned:

    FAILED:ODPS-0130121:Invalid argument type - line 1:7 'testfunc':in function class
  • Possible causes

    The data types of input parameters for a built-in function are invalid.

  • Solutions

    We recommend that you check the data types of input parameters to make sure that the data types meet the requirements for input parameters of built-in functions.

When I view the properties of a task in DataWorks Operation Center, the displayed task status is suspended. Why?

Check whether the task is started based on the project configuration.

  • If the task is started, check whether the ancestor task of the task fails.

  • If the task is not started, right-click the worker node to check whether the node is properly running or rename the task and configure scheduling properties.

When I perform operations in DataWorks Data Integration, a message always appears in the upper-right corner and prompts me to check whether the Order field is deleted. Why?

Check whether the Order field in the database is deleted.

Clear the cache, reconfigure or re-create the synchronization task, and then verify the task status again.

When I run the odpscmd -f command to execute an SQL file, the execution fails but no error message is returned. What do I do?

Obtain the runtime logs or error messages of the task to identify the cause of the issue.

Use Shell to run the odpscmd -f command to print logs to Shell. The log information indicates that the call is normal in Shell. An error is reported and no log is generated when the call is made in crontab.

To address this issue, record the output of the task execution in crontab. If an issue occurs, obtain the task logs from the log file. The command that you run is odpscmd -f xxx.sql >> path/to/odpscmd.log 2>&1.

When I use DataWorks, a large number of data synchronization tasks are in the waiting state. Why?

If synchronization tasks are in the waiting state when they use shared scheduling resources, optimize batch synchronization tasks to maximize the synchronization speed.

You can also add scheduling resources. For more information, see Create and use a custom resource group for Data Integration.

When a Shell task is executed, one of the servers added by using the scheduling resource management feature is always displayed in the stopped state. The server is still in the stopped state even after the re-initialization. Why?

  • If the internal network for connecting cloud products is used, check whether the machine name in the registration information is the real name of the machine. You can run the hostname command on the ECS instance to obtain the machine name. Custom names are not supported.

  • If a virtual private cloud (VPC) is used, check whether the ECS hostname is modified. Take note that the ECS hostname is not the same as the instance name. If the ECS hostname is modified, run the cat /etc/hosts command on the ECS instance to check whether the binding is valid.