ossimport is a tool for migrating data to Object Storage Service (OSS). You can deploy ossimport on a local server or an Elastic Compute Service (ECS) instance to migrate data from local storage or third-party cloud storage to OSS. Alibaba Cloud also provides Data Online Migration and the cross-region replication feature for you to migrate data to the cloud and replicate data between OSS buckets on a GUI.
ossimport does not verify data after migration and therefore does not guarantee data consistency and integrity. After a migration task is complete, make sure that you verify data consistency between the migration source and destination.
If you delete the source data without verifying data consistency between the source and destination, you are responsible for any losses and consequences that arise.
To migrate data from third-party data sources, we recommend that you use Data Online Migration.
To replicate objects between OSS buckets in near real time, we recommend that you use the cross-region replication feature of OSS.
Features
ossimport supports a wide range of data sources, such as on-premises file systems, Qiniu Cloud Object Storage (KODO), Baidu Object Storage (BOS), Amazon Simple Storage Service (Amazon S3), Azure Blob Storage, UPYUN Storage Service (USS), Tencent Cloud Object Service (COS), Kingsoft Standard Storage Service (KS3), HTTP and HTTPS URL lists, and Alibaba Cloud OSS.
ossimport supports the standalone deployment and distributed deployment modes. The standalone mode is easy to deploy and use. The distributed mode is suitable for large-scale data migration.
NoteIn standalone mode, only one bucket can be migrated at a time.
ossimport supports resumable upload.
ossimport supports traffic throttling.
ossimport supports migration of objects that are modified later than a specific time point and objects whose names contain a specific prefix.
ossimport supports concurrent data downloads and uploads.
Billing rules
ossimport is available free of charge. However, if you use ossimport to migrate data over the Internet, you may be charged outbound traffic fees and request fees on the data source side and request fees on the OSS side. If transfer acceleration is used during migration, you are additionally charged transfer acceleration fees.
Usage notes
Migration speed
The migration speed of ossimport varies based on several factors, such as the read bandwidth of the data source, local network bandwidth, and size of the files to be migrated. Migration of files smaller than 200 KB is slow due to the large number of IOPS required for processing numerous small files.
Migration of archived files
If you want to migrate archived files, you must restore the archived files before you can migrate the files.
Data staging
When you use ossimport to migrate data, data streams are first transferred to the local memory and then uploaded to the destination.
Source data retention
During a data migration task, ossimport performs only read operations on source data. It does not perform write operations. This ensures that the source data is not modified or deleted.
Migration by using ossutil
To migrate data smaller than 30 TB in size, we recommend that you use ossutil. ossutil is a lightweight and easy-to-use tool. You can use the -u, --update, and --snapshot-path options to incrementally migrate files. For more information, see Overview.
Runtime environment
ossimport can be deployed on Linux or Windows that meets the following requirements:
Windows 7 or later
CentOS 6 or CentOS 7
Java 7 or Java 8
ossimport cannot be deployed on Windows in distributed mode.
Deployment modes
ossimport supports the standalone and distributed deployment modes.
Standalone deployment is suitable for the migration of data smaller than 30 TB in size. To deploy ossimport in standalone mode, download the ossimport package for standalone deployment. You can deploy ossimport on a device that can access the data to be migrated and the OSS bucket to which you want to migrate the data.
Distributed deployment is suitable for the migration of data larger than 30 TB in size. To deploy ossimport in distributed mode, download the ossimport package for distributed deployment. You can deploy ossimport on multiple devices that can access the data to be migrated and the OSS bucket to which you want to migrate the data.
NoteTo reduce the time that is required to migrate large amounts of data, you can deploy ossimport on multiple ECS instances that reside in the same region as your OSS bucket and connect the source server to a virtual private cloud (VPC) by using Express Connect. This way, you can use the ECS instances to migrate data over the VPC at a faster migration speed.
You can also use ossimport to transmit data over the Internet. In this case, the transmission speed is affected by the bandwidth of your on-premises machine.
Standalone mode
The Master, Worker, Tracker, and Console modules are compressed into ossimport2.jar
and run on the same device. The system has only one worker.
The following code shows the file structure in standalone mode:
ossimport
├── bin
│ └── ossimport2.jar # The JAR package that contains the Master, Worker, Tracker, and Console modules.
├── conf
│ ├── local_job.cfg # The Job configuration file in standalone deployment.
│ └── sys.properties # The configuration file that contains system parameters.
├── console.bat # The Windows command line tool used to run commands step by step.
├── console.sh # The Linux command line tool used to run commands step by step.
├── import.bat # The script that runs migration tasks on Windows with a few clicks based on the conf/local_job.cfg configuration file. The script encapsulates all steps of a migration task, including starting the migration task, migrating data, verifying data, and retrying the migration task.
├── import.sh # The script that runs migration tasks on Linux with a few clicks based on the conf/local_job.cfg configuration file. The script encapsulates all steps of a migration task, including starting the migration task, migrating data, verifying data, and retrying the migration task.
├── logs # The directory that contains logs.
└ ── README.md # The file that provides instructions on how to use ossimport. We recommend that you read this file before you use ossimport.
import.bat and import.sh are scripts used to run migration tasks with a few clicks. You can run these scripts after you modify the
local_job.cfg
configuration file.console.bat and console.sh are command-line tools used to run commands step by step.
Run scripts or commands in the
ossimport
directory, which is the directory that contains the*.bat and *.sh
files.
Distributed mode
The ossimport architecture in distributed mode consists of a master and multiple workers. The following code shows the structure:
Master --------- Job --------- Console
|
|
TaskTracker
|_____________________
|Task | Task | Task
| | |
Worker Worker Worker
Parameter | Description |
Master | The master that splits a migration task into multiple subtasks by data size and number of files. You can specify the data size and number of files in the sys.properties file. The master splits a migration task into multiple subtasks by performing the following steps:
|
Worker |
|
TaskTracker | The TaskTracker that distributes subtasks and tracks subtask status. The TaskTracker is abbreviated as tracker. |
Console | The console that interacts with users, receives command input, and displays command output. The console supports system management commands including deploy, start, and stop, and job management commands including submit, retry, and clean. |
Job | The data migration task submitted by users. One task corresponds to one |
Task | The subtask that migrates a portion of files. A migration task can be divided into multiple subtasks by data size and number of files. The minimal unit for dividing a migration task into subtasks is a file. One file is not assigned to multiple subtasks. |
In distributed deployment, you can start multiple devices and run only one worker on each device to migrate data. Subtasks are evenly assigned to workers, and a worker runs multiple subtasks.
The following code shows the file structure in distributed mode:
ossimport
├── bin
│ ├── console.jar # The JAR package for the Console module.
│ ├── master.jar # The JAR package for the Master module.
│ ├── tracker.jar # The JAR package for the Tracker module.
│ └── worker.jar # The JAR package for the Worker module.
├── conf
│ ├── job.cfg # The Job configuration file template.
│ ├── sys.properties # The configuration file that contains system parameters.
│ └── workers # The list of workers.
├── console.sh # The command-line tool. Only Linux is supported.
├── logs # The directory that contains logs.
└ ── README.md # The file that provides instructions on how to use ossimport. We recommend that you read this file before you use ossimport.
Configuration files
The sys.properties and local_job.cfg configuration files are available in standalone mode. The sys.properties, job.cfg, and workers configuration files are available in distributed mode. The local_job.cfg and job.cfg configuration files have the same parameters. The workers configuration file is exclusive to the distributed mode.
sys.properties: the system parameters.
Parameter
Description
Remarks
workingDir
The working directory.
The directory to which the package is decompressed. Do not modify this parameter in standalone mode. In distributed mode, the working directory must be the same for each device.
workerUser
The SSH username used to log on to the device on which a worker resides.
If the privateKeyFile parameter is specified, the value of the privateKeyFile parameter is used.
If the privateKeyFile parameter is not specified, the values of the workerUser and workerPassword parameters are used.
Do not modify this parameter in standalone mode.
workerPassword
The SSH password used to log on to the device on which a worker resides.
Do not modify this parameter in standalone mode.
privateKeyFile
The path of the private key file.
If you have already established an SSH connection, you can specify this parameter. Otherwise, leave this parameter empty.
If the privateKeyFile parameter is specified, the value of privateKeyFile is used.
If the privateKeyFile parameter is not specified, the values of the workerUser and workerPassword parameters are used.
Do not modify this parameter in standalone mode.
sshPort
The SSH port.
The default value is 22. In most cases, we recommend that you retain the default value. Do not modify this parameter in standalone mode.
workerTaskThreadNum
The maximum number of threads for a worker to run subtasks.
This parameter is related to the device memory and network conditions. We recommend that you set this parameter to 60.
The value can be increased for physical machines. For example, you can set this parameter to 150. If the network bandwidth is already full, do not further increase the value.
If the network conditions are poor, reduce the value. For example, you can set this parameter to 30. This way, you can prevent timeout errors caused by competition for network resources.
workerMaxThroughput(KB/s)
The maximum throughput for data migration of a worker.
This parameter can be used for throttling. The default value is 0, which specifies that no throttling is imposed.
dispatcherThreadNum
The number of threads for subtask distribution and state check by the tracker.
If you do not have special requirements, retain the default value.
workerAbortWhenUncatchedException
Specifies whether to skip the error or terminate the subtask if an unknown error occurs.
By default, the error is skipped if an unknown error occurs.
workerRecordMd5
Specifies whether to use the x-oss-meta-md5 metadata item to record the MD5 hash of files to be migrated. By default, the MD5 hash is not recorded.
The value of this parameter is used to verify data integrity.
job.cfg: the configurations for data migration tasks. The
local_job.cfg
andjob.cfg
configuration files differ in names but contain the same parameters.Parameter
Description
Remarks
jobName
The name of the migration task. The value is of the STRING type.
A task name uniquely identifies a task. A task name can contain letters, digits, underscores (_), and hyphens (-), and must be 4 to 128 characters in length. You can submit multiple tasks that have different names.
If you submit a task with the same name as an existing task, the system prompts that the task already exists. Before you clean the existing task, you are not allowed to submit the task with the same name.
jobType
The type of the migration task. The value is of the STRING type.
Default value:
import
.
isIncremental
Specifies whether to enable the incremental migration mode. The value is of the BOOLEAN type.
Default value: false.
If this parameter is set to true, ossimport performs a data scan at intervals of seconds specified by incrementalModeInterval to detect incremental data and then migrates detected incremental data to OSS.
incrementalModeInterval
The migration interval in seconds in the incremental migration mode. The value is of the INTEGER type.
This parameter is valid if the isIncremental parameter is set to true. The minimum interval is 900 seconds. We recommend that you set the parameter to a value not less than 3600 to prevent request surges and additional overhead.
importSince
The time condition for the data migration task. Data whose last modified time is later than the value of this parameter is migrated. The value is of the INTEGER type. Unit: seconds.
The timestamp must be in the UNIX format. A UNIX timestamp is the number of seconds that have elapsed since 00:00:00 Thursday, January 1, 1970. You can run the date +%s command to query the UNIX timestamp.
The default value is 0, which specifies that all data is migrated.
srcType
The type of migration source. The value is of the STRING type and is case-sensitive.
Valid values:
local
: migrates data from local storage to OSS. If this value is specified, specify the srcPrefix parameter and leave the srcAccessKey, srcSecretKey, srcDomain, and srcBucket parameters empty.oss
: migrates data from an OSS bucket to another OSS bucket.qiniu
: migrates data from KODO to OSS.bos
: migrates data from BOS to OSS.ks3
: migrates data from KS3 to OSS.s3
: migrates data from Amazon S3 to OSS.youpai
: migrates data from USS to OSS.http
: migrates data from HTTP or HTTPS URL lists to OSS.cos
: migrates data from COS to OSS.azure
: migrates data from Azure Blob Storage to OSS.
srcAccessKey
The AccessKey ID used to access the source. The value is of the STRING type.
If the srcType parameter is set to
oss
,qiniu
,baidu
,ks3
, ors3
, specify the AccessKey ID used to access the source.If the srcType parameter is set to
local
orhttp
, ignore this parameter.If the srcType parameter is set to
youpai
orazure
, specify the account name used to access the source.
srcSecretKey
The AccessKey secret used to access the source. The value is of the STRING type.
If the srcType parameter is set to
oss
,qiniu
,baidu
,ks3
, ors3
, specify the AccessKey secret used to access the source.If the srcType parameter is set to
local
orhttp
, ignore this parameter.If the srcType parameter is set to
youpai
, specify the operator password used to access the source.If the srcType parameter is set to
azure
, specify the account key used to access the source.
srcDomain
The source endpoint.
If the srcType parameter is set to
local
orhttp
, ignore this parameter.If the srcType parameter is set to
oss
, specify the endpoint that does not include the bucket name. You can obtain the endpoint from the OSS console.If the srcType parameter is set to
qiniu
, specify the domain name of the bucket. You can obtain the domain name from the KODO console.If the srcType parameter is set to bos, specify the BOS domain name, such as
http://bj.bcebos.com
orhttp://gz.bcebos.com
.If the srcType parameter is set to ks3, specify the KS3 domain name, such as
http://kss.ksyun.com
,http://ks3-cn-beijing.ksyun.com
, orhttp://ks3-us-west-1.ksyun.coms
.If the srcType parameter is set to
S3
, specify the endpoint for the corresponding Amazon S3 region.If the srcType parameter is set to
youpai
, specify the USS domain name, such ashttp://v0.api.upyun.com
(automatically identified optimal line),http://v1.api.upyun.com
(China Telecom line),http://v2.api.upyun.com
(China Unicom line), orhttp://v3.api.upyun.com
(China Mobile line).If the srcType parameter is set to
cos
, specify the region where the COS bucket resides. Example: ap-guangzhou.If the srcType parameter is set to
cos
, specify the endpoint suffix in the Azure Blob Storage connection string. Example: core.chinacloudapi.cn.
srcBucket
The name of the source bucket or container.
If the srcType parameter is set to
local
orhttp
, ignore this parameter.If the srcType parameter is set to
azure
, specify the name of the source container.If the srcType parameter is set to a value other than local and azure, specify the name of the bucket.
srcPrefix
The source prefix. The value is of the STRING type. This parameter is empty by default.
If the srcType parameter is set to local, specify the full path that ends with a forward slash (/). If the path contains two or more directory levels, separate them with forward slashes (/). Examples:
c:/example/
and/data/example/
.ImportantPaths such as c:/example//, /data//example/, and /data/example// are invalid.
If the srcType parameter is set to
oss
,qiniu
,bos
,ks3
,youpai
, ors3
, specify the prefix of the objects to be migrated, excluding the bucket name, such asdata/to/oss/
.To migrate all objects, leave the srcPrefix parameter empty.
destAccessKey
The AccessKey ID used to access the destination OSS bucket. The value is of the STRING type.
You can obtain the AccessKey ID used to access the destination OSS bucket from the OSS console.
destSecretKey
The AccessKey secret used to access the destination OSS bucket. The value is of the STRING type.
You can obtain the AccessKey secret used to access the destination OSS bucket from the OSS console.
destDomain
The destination endpoint. The value is of the STRING type.
You can obtain the destination endpoint that does not include the bucket name from the OSS console. For more information, see Regions and endpoints.
destBucket
The destination bucket. The value is of the STRING type.
The name of the destination OSS bucket. The name cannot end with a forward slash (/).
destPrefix
The name prefix of the migrated objects in the destination OSS bucket.
This parameter is empty by default. If you retain the default value, the migrated objects are stored in the root directory in the destination bucket.
To migrate data to a specific directory in OSS, end the prefix with a forward slash (/). Example:
data/in/oss/
.OSS object names cannot start with a forward slash (/). Do not set the destPrefix parameter to a value that starts with a forward slash (/).
A local file whose path is in the srcPrefix+relativePath format is migrated to an OSS path in the destDomain/destBucket/destPrefix+relativePath format.
An object whose path is in the srcDomain/srcBucket/srcPrefix+relativePath format in the cloud is migrated to an OSS path in the destDomain/destBucket/destPrefix+relativePath format.
taskObjectCountLimit
The maximum number of files in each subtask. The value is of the INTEGER type. The default value is 10000.
This parameter affects the concurrency level of subtasks that you want to run. In most cases, set this parameter to a value calculated based on the following formula: Total number of files/Total number of workers/Number of migration threads (workerTaskThreadNum). The maximum value is 50000. If the total number of files is unknown, retain the default value.
taskObjectSizeLimit
The maximum data size in bytes for each subtask. The value is of the INTEGER type. The default maximum data size is 1 GB.
This parameter affects the concurrency level of subtasks that you want to run. In most cases, set this parameter to a value calculated based on the following formula: Total data size/Total number of workers/Number of migration threads (workerTaskThreadNum). If the total data size is unknown, retain the default value.
isSkipExistFile
Specifies whether to skip the objects that already exist during data migration. The value is of the BOOLEAN type.
A value of true specifies that objects are skipped based on their size and last modified time. A value of false specifies that objects that already exist are overwritten. The default value is false.
scanThreadCount
The number of threads that scan files in parallel. The value is of the INTEGER type.
Default value: 1.
Valid values: 1 to 32.
This parameter affects the efficiency of file scanning. If you do not have special requirements, retain the default value.
maxMultiThreadScanDepth
The maximum depth of directories for parallel scanning. The value is of the INTEGER type.
Default value: 1.
Valid values: 1 to 16.
A value of 1 specifies parallel scanning in top-level directories.
If you do not have special requirements, retain the default value. A large value may cause task failures.
appId
The application ID of COS. The value is of the INTEGER type.
This parameter is valid if the srcType parameter is set to cos.
httpListFilePath
The absolute path of the HTTP URL list file. The value is of the STRING type.
This parameter is valid if the srcType parameter is set to http. If the source is an HTTP URL list, you must specify this parameter. Example: c:/example/http.list.
An HTTP URL in the file must be divided into two parts separated with one or more spaces. The first part is the prefix, and the second part is the relative path of the object in OSS after the object hosted at the URL is migrated. For example, the HTTP URL list file in the c:/example/http.list path contains the following URLs:
http://xxx.xxx.com/aa/ bb.jpg http://xxx.xxx.com/cc/ dd.jpg
If you set the destPrefix parameter to ee/, the objects migrated to OSS have the following names:
ee/bb.jpg ee/dd.jpg
workers: only available in distributed mode. Specify one IP address in one line. Example:
192.168.1.6 192.168.1.7 192.168.1.8
In the preceding example,
192.168.1.6
in the first line is the master. In other words,192.168.1.6
is the IP address of the device on which the master, workers, and tracker are started. The console also runs on this device.Make sure that the same username, logon method, and working directory are used for all workers.
Configuration file examples
The following table provides links to configuration file examples of data migration tasks in distributed mode. The name of the configuration file in standalone mode is local_job.cfg
, which contains the same configuration items as the configuration file in distributed mode.
Migration scenario | Configuration file | Description |
Migrate data from local storage to OSS | Set the srcPrefix parameter to an absolute path that ends with a forward slash (/). Examples: | |
Migrate data from KODO to OSS | You can leave the srcPrefix and destPrefix parameters empty. If you want to specify the parameters, end the values with a forward slash (/). Example: | |
Migrate data from BOS to OSS | You can leave the srcPrefix and destPrefix parameters empty. If you want to specify the parameters, end the values with a forward slash (/). Example: | |
Migrate data from Amazon S3 to OSS | For more information, see AWS service endpoints. | |
Migrate data from USS to OSS | Set the srcAccessKey parameter to the operator account and the srcSecretKey parameter to the corresponding password. | |
Migrate data from COS to OSS | Specify the srcDomain parameter based on the requirement for COS V4. Example: | |
Migrate data from Azure Blob Storage to OSS | Set the srcAccessKey parameter to the storage account and the srcSecretKey parameter to the access key. Set the srcDomain parameter to the endpoint suffix in the Azure Blob Storage connection string. Example: | |
Migrate data between OSS buckets | This method is suitable for migrating objects with different prefixes in their names, between buckets across different regions, or between buckets in different storage classes. We recommend that you deploy ossimport on ECS instances and use internal endpoints to minimize traffic costs. |
Advanced settings
Throttle traffic
In the sys.properties configuration file, the workerMaxThroughput(KB/s) parameter specifies the maximum throughput for data migration of a worker. In throttling scenarios such as source-side throttling and network throttling, set this parameter to a value less than the maximum available bandwidth of your device based on your business needs. After the modification is complete, restart the service for the modification to take effect.
In distributed mode, modify the sys.properties configuration file in the $OSS_IMPORT_WORK_DIR/conf directory for each worker and restart the service.
To throttle traffic, modify the sys.properties configuration file as scheduled by using crontab and restart the service for the modification to take effect.
Modify the number of concurrent subtasks
In the sys.properties configuration file, the workerTaskThreadNum parameter specifies the number of concurrent subtasks that a worker can run. If the network conditions are poor and a worker has to process a large number of subtasks, a timeout error occurs. To resolve this issue, modify the configuration by reducing the number of concurrent subtasks and then restart the service.
In the sys.properties configuration file, the workerMaxThroughput(KB/s) parameter specifies the maximum throughput for data migration of a worker. In throttling scenarios such as source-side throttling and network throttling, set this parameter to a value less than the maximum available bandwidth of your device based on your business needs.
In the job.cfg configuration file, the taskObjectCountLimit parameter specifies the maximum number of files in each subtask. The default value is 10000. This parameter configuration affects the number of subtasks. The efficiency of implementing concurrent subtasks degrades if the number of subtasks is small.
In the job.cfg configuration file, the taskObjectSizeLimit parameter specifies the maximum data size for each subtask. The default maximum data size for each subtask is 1 GB. This parameter configuration affects the number of subtasks. The efficiency of implementing concurrent subtasks degrades if the number of subtasks is small.
ImportantBefore you start data migration, configure the parameters in the configuration files.
After you modify parameters in the sys.properties configuration file, restart the local server or the ECS instance on which ossimport is deployed for the modification to take effect.
After the migration tasks configured in job.cfg files are submitted, the parameters in the job.cfg files cannot be modified.
Verify data without migrating data
ossimport allows you to only verify data without performing a migration operation. To skip data migration and only verify data, set the jobType parameter in the job.cfg or local_job.cfg file to audit instead of import and configure other parameters in the same way you configure them for data migration.
Specify the incremental data migration mode
In incremental data migration mode, ossimport migrates existing full data after a migration task is started and migrates incremental data at intervals. The full data migration is started after you submit the task. Then, incremental data is migrated at intervals. The incremental data migration mode is suitable for data backup and synchronization.
The following configuration items are available for the incremental data migration mode:
In the job.cfg configuration file, the isIncremental parameter specifies whether to enable the incremental data migration mode. Valid values: true and false. Default value: false.
In the job.cfg configuration file, the incrementalModeInterval parameter specifies the interval in seconds at which incremental data migration is implemented. This configuration item takes effect only if the
isIncremental
parameter is set to true. The minimum value is900
. We recommend that you do not set this parameter to a value less than 3600. Otherwise, a large number of requests are wasted. This results in additional overheads.
Filter data to be migrated
You can set filtering conditions to migrate objects that meet specific conditions. ossimport allows you to use the prefix and last modified time to specify objects to migrate.
In the job.cfg configuration file, the srcPrefix parameter specifies the prefix of source data to be migrated. This parameter is empty by default.
If you set the
srcType
parameter to local, set this parameter to the path of the local directory. You must specify the full path that ends with a forward slash (/). If the path contains two or more directory levels, separate them with forward slashes (/). Examples:c:/example/
or/data/example/
.If you set the
srcType
parameter tooss
,qiniu
,bos
,ks3
,youpai
, ors3
, set this parameter to the name prefix of source objects without the bucket name. Example:data/to/oss/
. To migrate all the data in the source, leave thesrcPrefix
parameter empty.
In the job.cfg configuration file, the importSince parameter specifies the last modified time in seconds of source data. The importSince parameter specifies the timestamp in the UNIX format. It is the number of seconds that have elapsed since 00:00:00 Thursday, January 1, 1970. You can run the date +%s command to query the UNIX timestamp. The default value is 0, which specifies that all the data will be migrated. In incremental data migration mode, this parameter is valid only for full data migration. In a migration mode other than incremental data migration, this parameter is valid for the entire migration task.
If the last modified time of an object is earlier than the value of the
importSince
parameter, the object is not migrated.If the last modified time of an object is later than the value of the
importSince
parameter, the object is migrated.