This topic describes the differences between the local mode and distributed mode in which MapReduce jobs run. It also provides examples of MapReduce jobs in local mode.
Introduction to the local mode
Before you run a job in local mode, you can specify the -local option in the JAR command to simulate the running of the job. This way, you can perform local debugging on the job.
During job running, the client downloads the metadata and data of the input table, metadata of the output
table, and resources that are required for local debugging from MaxCompute. The downloaded
data is saved to a local directory named warehouse
.
After the job is completed, the computing results are saved to a file in the warehouse
directory. If the input table and required resources are downloaded to the warehouse
directory, MapReduce directly references the data and files in the directory next
time, instead of downloading the data again.
Differences between the local mode and distributed mode
A MapReduce job that runs in local mode starts multiple map and reduce tasks to process data. These tasks run in sequence.
- Rows in the input table: A maximum of 100 rows of data can be downloaded in local mode.
- Resource usage: In distributed mode, MaxCompute limits the size of resources that can be referenced. For more information, see MapReduce limits. However, no limits are imposed on the size of resources in local mode.
- Security: MaxCompute MapReduce and user-defined functions (UDFs) are limited by a Java sandbox in distributed mode. However, no limits are imposed in local mode.
Examples
odps:my_project> jar -l com.aliyun.odps.mapred.example.WordCount wc_in wc_out
Summary:
counters: 10
map-reduce framework
combine_input_groups=2
combine_output_records=2
map_input_bytes=4
map_input_records=1
map_output_records=2
map_output_[wc_out]_bytes=0
map_output_[wc_out]_records=0
reduce_input_groups=2
reduce_output_[wc_out]_bytes=8
reduce_output_[wc_out]_records=2
OK
For more information about the sample code of WordCount, see WordCount.
warehouse
.
<warehouse>
|____my_project (project directory)
|____ <__tables__>
| |__wc_in (table data directory)
| | |____ data (file)
| | |
| | |____ <__schema__> (file)
| |__wc_out (table data directory)
| |____ data (file)
| |
| |____ <__schema__> (file)
|
|____ <__resources__>
|
|___table_resource_name (table resource)
| |____<__ref__>
|
|___ file_resource_name (file resource)
- Directories at the same level as
my_project
indicate projects. Directories at the same level aswc_in
andwc_out
indicate data tables. The table data that you read or write by using the JAR command is downloaded to directories at this level. - The <__schema__> file stores the metadata of a table. The following code defines the file format:
project=local_project_name table=local_table_name columns=col1_name:col1_type,col2_name:col2_type partitions=p1:STRING,p2:BIGINT -- In this example, you do not need to specify this field.
Separate the name and data type of a column with a colon (:). Separate columns with commas (,). The project and table names,
project_name.table_name
, must be declared at the beginning of the <__schema__> file. Separate the declaration and column definition with a comma (,). Example:project_name.table_name,col1_name:col1_type,col2_name:col2_type,……
- The data file in the
tables
directory stores the table data. The number of columns and column data must match the definition in the _schema_ file. Separate columns with commas (,).The _schema_ file in thewc_in
directory contains the following data:my_project.wc_in,key:STRING,value:STRING
The data file contains the following data:0,2
The client downloads the metadata and part of the data of a table from MaxCompute, and saves the data to the preceding files. The next time you run this example program, the client directly uses the data in thewc_in
directory, instead of downloading it again.Note Data can be downloaded from MaxCompute only for MapReduce jobs that run in local mode.The _schema_ file in thewc_out
directory contains the following data:my_project.wc_out,key:STRING,cnt:BIGINT
The data file contains the following data:0,1 2,1
The client downloads the metadata of thewc_out
table from MaxCompute, and saves the data to the _schema_ file. After a job is completed, the results are saved to the data file.Note- You can also edit the _schema_ and data files, and save the files in table directories.
- If you run a job in local mode and the client detects that the table directory exists, the client does not download the information of this table from MaxCompute. The local table directory can include a table that does not exist in MaxCompute.